Computation and Language 101
☆ VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
Chenyu Zhou, Mengdan Zhang, Peixian Chen, Chaoyou Fu, Yunhang Shen, Xiawu Zheng, Xing Sun, Rongrong Ji
The swift progress of Multi-modal Large Models (MLLMs) has showcased their
impressive ability to tackle tasks blending vision and language. Yet, most
current models and benchmarks cater to scenarios with a narrow scope of visual
and textual contexts. These models often fall short when faced with complex
comprehension tasks, which involve navigating through a plethora of irrelevant
and potentially misleading information in both text and image forms. To bridge
this gap, we introduce a new, more demanding task known as Interleaved
Image-Text Comprehension (IITC). This task challenges models to discern and
disregard superfluous elements in both images and text to accurately answer
questions and to follow intricate instructions to pinpoint the relevant image.
In support of this task, we further craft a new VEGA dataset, tailored for the
IITC task on scientific content, and devised a subtask, Image-Text Association
(ITA), to refine image-text correlation skills. Our evaluation of four leading
closed-source models, as well as various open-source models using VEGA,
underscores the rigorous nature of IITC. Even the most advanced models, such as
Gemini-1.5-pro and GPT4V, only achieved modest success. By employing a
multi-task, multi-scale post-training strategy, we have set a robust baseline
for MLLMs on the IITC task, attaining an $85.8\%$ accuracy rate in image
association and a $0.508$ Rouge score. These results validate the effectiveness
of our dataset in improving MLLMs capabilities for nuanced image-text
comprehension.
comment: Project Page: https://zhourax.github.io/VEGA/
☆ Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding
Recent advances in vision-language models have significantly propelled video
understanding. Existing datasets and tasks, however, have notable limitations.
Most datasets are confined to short videos with limited events and narrow
narratives. For example, datasets with instructional and egocentric videos
often document the activities of one person in a single scene. Although some
movie datasets offer richer content, they are often limited to short-term
tasks, lack publicly available videos and frequently encounter data leakage
given the use of movie forums and other resources in LLM training. To address
the above limitations, we propose the Short Film Dataset (SFD) with 1,078
publicly available amateur movies, a wide variety of genres and minimal data
leakage issues. SFD offers long-term story-oriented video tasks in the form of
multiple-choice and open-ended question answering. Our extensive experiments
emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find
strong signals in movie transcripts leading to the on-par performance of people
and LLMs. We also show significantly lower performance of current models
compared to people when using vision data alone.
☆ Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
Reward models trained on human preference data have been proven to be
effective for aligning Large Language Models (LLMs) with human intent within
the reinforcement learning from human feedback (RLHF) framework. However, the
generalization capabilities of current reward models to unseen prompts and
responses are limited. This limitation can lead to an unexpected phenomenon
known as reward over-optimization, where excessive optimization of rewards
results in a decline in actual performance. While previous research has
advocated for constraining policy optimization, our study proposes a novel
approach to enhance the reward model's generalization ability against
distribution shifts by regularizing the hidden states. Specifically, we retain
the base model's language model head and incorporate a suite of text-generation
losses to preserve the hidden states' text generation capabilities, while
concurrently learning a reward head behind the same hidden states. Our
experimental results demonstrate that the introduced regularization technique
markedly improves the accuracy of learned reward models across a variety of
out-of-distribution (OOD) tasks and effectively alleviate the over-optimization
issue in RLHF, offering a more reliable and robust preference learning
paradigm.
comment: 21 pages
☆ DevBench: A multimodal developmental benchmark for language learning
Alvin Wei Ming Tan, Sunny Yu, Bria Long, Wanjing Anya Ma, Tonya Murray, Rebecca D. Silverman, Jason D. Yeatman, Michael C. Frank
How (dis)similar are the learning trajectories of vision-language models and
children? Recent modeling work has attempted to understand the gap between
models' and humans' data efficiency by constructing models trained on less
data, especially multimodal naturalistic data. However, such models are often
evaluated on adult-level benchmarks, with limited breadth in language abilities
tested, and without direct comparison to behavioral data. We introduce
DevBench, a multimodal benchmark comprising seven language evaluation tasks
spanning the domains of lexical, syntactic, and semantic ability, with
behavioral data from both children and adults. We evaluate a set of
vision-language models on these tasks, comparing models and humans not only on
accuracy but on their response patterns. Across tasks, models exhibit variation
in their closeness to human response patterns, and models that perform better
on a task also more closely resemble human behavioral responses. We also
examine the developmental trajectory of OpenCLIP over training, finding that
greater training results in closer approximations to adult response patterns.
DevBench thus provides a benchmark for comparing models to human language
development. These comparisons highlight ways in which model and human language
learning processes diverge, providing insight into entry points for improving
language models.
☆ Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs
Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein
Large language models can memorize and repeat their training data, causing
privacy and copyright risks. To mitigate memorization, we introduce a subtle
modification to the next-token training objective that we call the goldfish
loss. During training, a randomly sampled subset of tokens are excluded from
the loss computation. These dropped tokens are not memorized by the model,
which prevents verbatim reproduction of a complete chain of tokens from the
training set. We run extensive experiments training billion-scale Llama-2
models, both pre-trained and trained from scratch, and demonstrate significant
reductions in extractable memorization with little to no impact on downstream
benchmarks.
comment: 9.5 pages, 8 figures, and 1 table in the main body. Code available at
https://github.com/ahans30/goldfish-loss
☆ A Fundamental Trade-off in Aligned Language Models and its Relation to Sampling Adaptors
The relationship between the quality of a string and its probability
$p(\boldsymbol{y})$ under a language model has been influential in the
development of techniques to build good text generation systems. For example,
several decoding algorithms have been motivated to manipulate
$p(\boldsymbol{y})$ to produce higher-quality text. In this work, we examine
the probability--quality relationship in language models explicitly aligned to
human preferences, e.g., through Reinforcement Learning through Human Feedback
(RLHF). We find that, given a general language model and its aligned version,
for corpora sampled from an aligned language model, there exists a trade-off
between the average reward and average log-likelihood of the strings under the
general language model. We provide a formal treatment of this issue and
demonstrate how a choice of sampling adaptor allows for a selection of how much
likelihood we exchange for the reward.
☆ CHIRON: Rich Character Representations in Long-Form Narratives
Characters are integral to long-form narratives, but are poorly understood by
existing story analysis and generation systems. While prior work has simplified
characters via graph-based methods and brief character descriptions, we aim to
better tackle the problem of representing complex characters by taking
inspiration from advice given to professional writers. We propose CHIRON, a new
`character sheet' based representation that organizes and filters textual
information about characters. We construct CHIRON sheets in two steps: a
Generation Module that prompts an LLM for character information via
question-answering and a Validation Module that uses automated reasoning and a
domain-specific entailment model to eliminate false facts about a character. We
validate CHIRON via the downstream task of masked-character prediction, where
our experiments show CHIRON is better and more flexible than comparable
summary-based baselines. We also show that metrics derived from CHIRON can be
used to automatically infer character-centricity in stories, and that these
metrics align with human judgments.
☆ Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation INTERSPEECH 2024
Automatic speech recognition (ASR) systems often falter while processing
stuttering-related disfluencies -- such as involuntary blocks and word
repetitions -- yielding inaccurate transcripts. A critical barrier to progress
is the scarcity of large, annotated disfluent speech datasets. Therefore, we
present an inclusive ASR design approach, leveraging large-scale
self-supervised learning on standard speech followed by targeted fine-tuning
and data augmentation on a smaller, curated dataset of disfluent speech. Our
data augmentation technique enriches training datasets with various
disfluencies, enhancing ASR processing of these speech patterns. Results show
that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset,
alongside data augmentation, can significantly reduce word error rates for
disfluent speech. Our approach not only advances ASR inclusivity for people who
stutter, but also paves the way for ASRs that can accommodate wider speech
variations.
comment: Accepted to INTERSPEECH 2024
☆ Let the Poem Hit the Rhythm: Using a Byte-Based Transformer for Beat-Aligned Poetry Generation
The intersection between poetry and music provides an interesting case for
computational creativity, yet remains relatively unexplored. This paper
explores the integration of poetry and music through the lens of beat patterns,
investigating whether a byte-based language model can generate words that fit
specific beat patterns within the context of poetry. Drawing on earlier
studies, we developed a method to train a byte-based transformer model, ByT5,
to align poems with beat patterns. The results demonstrate a high level of beat
alignment while maintaining semantic coherence. Future work will aim to improve
the model's ability to create complete beat-aligned poems.
comment: 5 pages, 3 figures, accepted for the 15th International Conference on
Computational Creativity, ICCC'24
☆ IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce
Wenxuan Ding, Weiqi Wang, Sze Heng Douglas Kwok, Minghao Liu, Tianqing Fang, Jiaxin Bai, Junxian He, Yangqiu Song
Enhancing Language Models' (LMs) ability to understand purchase intentions in
E-commerce scenarios is crucial for their effective assistance in various
downstream tasks. However, previous approaches that distill intentions from LMs
often fail to generate meaningful and human-centric intentions applicable in
real-world E-commerce contexts. This raises concerns about the true
comprehension and utilization of purchase intentions by LMs. In this paper, we
present IntentionQA, a double-task multiple-choice question answering benchmark
to evaluate LMs' comprehension of purchase intentions in E-commerce.
Specifically, LMs are tasked to infer intentions based on purchased products
and utilize them to predict additional purchases. IntentionQA consists of 4,360
carefully curated problems across three difficulty levels, constructed using an
automated pipeline to ensure scalability on large E-commerce platforms. Human
evaluations demonstrate the high quality and low false-negative rate of our
benchmark. Extensive experiments across 19 language models show that they still
struggle with certain scenarios, such as understanding products and intentions
accurately, jointly reasoning with products and intentions, and more, in which
they fall far behind human performances. Our code and data are publicly
available at https://github.com/HKUST-KnowComp/IntentionQA.
☆ Datasets for Multilingual Answer Sentence Selection
Answer Sentence Selection (AS2) is a critical task for designing effective
retrieval-based Question Answering (QA) systems. Most advancements in AS2 focus
on English due to the scarcity of annotated datasets for other languages. This
lack of resources prevents the training of effective AS2 models in different
languages, creating a performance gap between QA systems in English and other
locales. In this paper, we introduce new high-quality datasets for AS2 in five
European languages (French, German, Italian, Portuguese, and Spanish), obtained
through supervised Automatic Machine Translation (AMT) of existing English AS2
datasets such as ASNQ, WikiQA, and TREC-QA using a Large Language Model (LLM).
We evaluated our approach and the quality of the translated datasets through
multiple experiments with different Transformer architectures. The results
indicate that our datasets are pivotal in producing robust and powerful
multilingual AS2 models, significantly contributing to closing the performance
gap between English and other languages.
☆ Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger
In reinforcement learning, specification gaming occurs when AI systems learn
undesired behaviors that are highly rewarded due to misspecified training
goals. Specification gaming can range from simple behaviors like sycophancy to
sophisticated and pernicious behaviors like reward-tampering, where a model
directly modifies its own reward mechanism. However, these more pernicious
behaviors may be too complex to be discovered via exploration. In this paper,
we study whether Large Language Model (LLM) assistants which find easily
discovered forms of specification gaming will generalize to perform rarer and
more blatant forms, up to and including reward-tampering. We construct a
curriculum of increasingly sophisticated gameable environments and find that
training on early-curriculum environments leads to more specification gaming on
remaining environments. Strikingly, a small but non-negligible proportion of
the time, LLM assistants trained on the full curriculum generalize zero-shot to
directly rewriting their own reward function. Retraining an LLM not to game
early-curriculum environments mitigates, but does not eliminate,
reward-tampering in later environments. Moreover, adding harmlessness training
to our gameable environments does not prevent reward-tampering. These results
demonstrate that LLMs can generalize from common forms of specification gaming
to more pernicious reward tampering and that such behavior may be nontrivial to
remove.
☆ BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
In recent years, the input context sizes of large language models (LLMs) have
increased dramatically. However, existing evaluation methods have not kept
pace, failing to comprehensively assess the efficiency of models in handling
long contexts. To bridge this gap, we introduce the BABILong benchmark,
designed to test language models' ability to reason across facts distributed in
extremely long documents. BABILong includes a diverse set of 20 reasoning
tasks, including fact chaining, simple induction, deduction, counting, and
handling lists/sets. These tasks are challenging on their own, and even more
demanding when the required facts are scattered across long natural text. Our
evaluations show that popular LLMs effectively utilize only 10-20\% of the
context and their performance declines sharply with increased reasoning
complexity. Among alternatives to in-context reasoning, Retrieval-Augmented
Generation methods achieve a modest 60\% accuracy on single-fact question
answering, independent of context length. Among context extension methods, the
highest performance is demonstrated by recurrent memory transformers, enabling
the processing of lengths up to 11 million tokens. The BABILong benchmark is
extendable to any length to support the evaluation of new upcoming models with
increased capabilities, and we provide splits up to 1 million token lengths.
☆ Evaluation of Large Language Models: STEM education and Gender Stereotypes
Smilla Due, Sneha Das, Marianne Andersen, Berta Plandolit López, Sniff Andersen Nexø, Line Clemmensen
Large Language Models (LLMs) have an increasing impact on our lives with use
cases such as chatbots, study support, coding support, ideation, writing
assistance, and more. Previous studies have revealed linguistic biases in
pronouns used to describe professions or adjectives used to describe men vs
women. These issues have to some degree been addressed in updated LLM versions,
at least to pass existing tests. However, biases may still be present in the
models, and repeated use of gender stereotypical language may reinforce the
underlying assumptions and are therefore important to examine further. This
paper investigates gender biases in LLMs in relation to educational choices
through an open-ended, true to user-case experimental design and a quantitative
analysis. We investigate the biases in the context of four different cultures,
languages, and educational systems (English/US/UK, Danish/DK, Catalan/ES, and
Hindi/IN) for ages ranging from 10 to 16 years, corresponding to important
educational transition points in the different countries. We find that there
are significant and large differences in the ratio of STEM to non-STEM
suggested education paths provided by chatGPT when using typical girl vs boy
names to prompt lists of suggested things to become. There are generally fewer
STEM suggestions in the Danish, Spanish, and Indian context compared to the
English. We also find subtle differences in the suggested professions, which we
categorise and report.
☆ The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models
Pre-trained Language models (PLMs) have been acknowledged to contain harmful
information, such as social biases, which may cause negative social impacts or
even bring catastrophic results in application. Previous works on this problem
mainly focused on using black-box methods such as probing to detect and
quantify social biases in PLMs by observing model outputs. As a result,
previous debiasing methods mainly finetune or even pre-train language models on
newly constructed anti-stereotypical datasets, which are high-cost. In this
work, we try to unveil the mystery of social bias inside language models by
introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose
{\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e.,
neurons) in a language model that can be attributed to undesirable behavior,
such as social bias. By formalizing undesirable behavior as a distributional
property of language, we employ sentiment-bearing prompts to elicit classes of
sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus
attributes the uneven distribution for different demographics to specific
Social Bias Neurons, which track the trail of unwanted behavior inside PLM
units to achieve interoperability. Moreover, derived from our interpretable
technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate
social biases. By studying BERT, RoBERTa, and their attributable differences
from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified
neurons, and further mitigate undesired behaviors. As measured by prior metrics
from StereoSet, our model achieves a higher degree of fairness while
maintaining language modeling ability with low cost.
☆ SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V. Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus Hudi, Railey Montalan, Ryan Ignatius, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse, Ivan Halim Parmonangan, Maria Khelli, Wenyu Zhang, Lucky Susanto, Reynard Adha Ryanda, Sonny Lazuardi Hermawan, Dan John Velasco, Muhammad Dehan Al Kautsar, Willy Fitra Hendria, Yasmin Moslem, Noah Flynn, Muhammad Farid Adilazuarda, Haochen Li, Johanes Lee, R. Damanhuri, Shuo Sun, Muhammad Reza Qorib, Amirbek Djanibekov, Wei Qi Leong, Quyet V. Do, Niklas Muennighoff, Tanrada Pansuwan, Ilham Firdausi Putra, Yan Xu, Ngee Chia Tai, Ayu Purwarianti, Sebastian Ruder, William Tjhi, Peerat Limkonchotiwat, Alham Fikri Aji, Sedrick Keh, Genta Indra Winata, Ruochen Zhang, Fajri Koto, Zheng-Xin Yong, Samuel Cahyawijaya
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural
variety, with over 1,300 indigenous languages and a population of 671 million
people. However, prevailing AI models suffer from a significant lack of
representation of texts, images, and audio datasets from SEA, compromising the
quality of AI models for SEA languages. Evaluating models for SEA languages is
challenging due to the scarcity of high-quality datasets, compounded by the
dominance of English training data, raising concerns about potential cultural
misrepresentation. To address these challenges, we introduce SEACrowd, a
collaborative initiative that consolidates a comprehensive resource hub that
fills the resource gap by providing standardized corpora in nearly 1,000 SEA
languages across three modalities. Through our SEACrowd benchmarks, we assess
the quality of AI models on 36 indigenous languages across 13 tasks, offering
valuable insights into the current AI landscape in SEA. Furthermore, we propose
strategies to facilitate greater AI advancements, maximizing potential utility
and resource equity for the future of AI in SEA.
comment: https://github.com/SEACrowd
☆ Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning
Large language models (LLMs) have demonstrated remarkable capabilities across
various tasks but still face challenges such as hallucinations. One potential
reason for hallucinations is the lack of relevant knowledge or context. Thus, a
promising solution to mitigate this issue involves instructing LLMs to respond
with "I do not know" when a question falls outside their knowledge domain or
the provided context. However, in this work, we observed that LLMs struggle to
admit their lack of knowledge, primarily due to existing instruction datasets
designed to encourage specific answers. To improve large language models'
capability to recognize the boundaries of their knowledge, we propose a novel
approach called uncertainty-sensitive tuning. This method involves two-stage
training designed for uncertainty recognition and prompt-sensitive activation.
In the first stage, we guide the LLM to reject unknown questions. In the second
stage, we recover the decreased performance in QA tasks by incorporating
designed causal instructions. By leveraging this method, we aim to enhance the
model's ability to identify areas of uncertainty. The experimental results
demonstrate that our proposed uncertainty-sensitive tuning method significantly
improves the performance of the Llama2-chat-7B model. Specifically, it achieves
a substantial 34.7% improvement in handling questions involving knowledge gaps
compared to the original model. Moreover, our approach outperforms GPT-4,
exhibiting a 9.4% increase in overall performance. We open-source the model and
code on GitHub.
☆ Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation
Assessing the performance of interpreting services is a complex task, given
the nuanced nature of spoken language translation, the strategies that
interpreters apply, and the diverse expectations of users. The complexity of
this task become even more pronounced when automated evaluation methods are
applied. This is particularly true because interpreted texts exhibit less
linearity between the source and target languages due to the strategies
employed by the interpreter.
This study aims to assess the reliability of automatic metrics in evaluating
simultaneous interpretations by analyzing their correlation with human
evaluations. We focus on a particular feature of interpretation quality, namely
translation accuracy or faithfulness. As a benchmark we use human assessments
performed by language experts, and evaluate how well sentence embeddings and
Large Language Models correlate with them. We quantify semantic similarity
between the source and translated texts without relying on a reference
translation. The results suggest GPT models, particularly GPT-3.5 with direct
prompting, demonstrate the strongest correlation with human judgment in terms
of semantic similarity between source and target texts, even when evaluating
short textual segments. Additionally, the study reveals that the size of the
context window has a notable impact on this correlation.
comment: Paper accepted at the European Association for Machine Translation
conference 2024
☆ Discovering influential text using convolutional neural networks ACL 2024
Experimental methods for estimating the impacts of text on human evaluation
have been widely used in the social sciences. However, researchers in
experimental settings are usually limited to testing a small number of
pre-specified text treatments. While efforts to mine unstructured texts for
features that causally affect outcomes have been ongoing in recent years, these
models have primarily focused on the topics or specific words of text, which
may not always be the mechanism of the effect. We connect these efforts with
NLP interpretability techniques and present a method for flexibly discovering
clusters of similar text phrases that are predictive of human reactions to
texts using convolutional neural networks. When used in an experimental
setting, this method can identify text treatments and their effects under
certain assumptions. We apply the method to two datasets. The first enables
direct validation of the model's ability to detect phrases known to cause the
outcome. The second demonstrates its ability to flexibly discover text
treatments with varying textual structures. In both cases, the model learns a
greater variety of text treatments compared to benchmark methods, and these
text features quantitatively meet or exceed the ability of benchmark methods to
predict the outcome.
comment: To be published in ACL 2024 Findings
☆ Enhancing Question Answering on Charts Through Effective Pre-training Tasks
To completely understand a document, the use of textual information is not
enough. Understanding visual cues, such as layouts and charts, is also
required. While the current state-of-the-art approaches for document
understanding (both OCR-based and OCR-free) work well, a thorough analysis of
their capabilities and limitations has not yet been performed. Therefore, in
this work, we addresses the limitation of current VisualQA models when applied
to charts and plots. To investigate shortcomings of the state-of-the-art
models, we conduct a comprehensive behavioral analysis, using ChartQA as a case
study. Our findings indicate that existing models particularly underperform in
answering questions related to the chart's structural and visual context, as
well as numerical information. To address these issues, we propose three simple
pre-training tasks that enforce the existing model in terms of both
structural-visual knowledge, as well as its understanding of numerical
questions. We evaluate our pre-trained model (called MatCha-v2) on three chart
datasets - both extractive and abstractive question datasets - and observe that
it achieves an average improvement of 1.7% over the baseline model.
☆ On the Evaluation of Speech Foundation Models for Spoken Language Understanding ACL
Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe
The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks
was recently introduced to address the need for open resources and benchmarking
of complex spoken language understanding (SLU) tasks, including both
classification and sequence generation tasks, on natural speech. The benchmark
has demonstrated preliminary success in using pre-trained speech foundation
models (SFM) for these SLU tasks. However, the community still lacks a
fine-grained understanding of the comparative utility of different SFMs.
Inspired by this, we ask: which SFMs offer the most benefits for these complex
SLU tasks, and what is the most effective approach for incorporating these
SFMs? To answer this, we perform an extensive evaluation of multiple supervised
and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs
with a lightweight prediction head, (ii) frozen SFMs with a complex prediction
head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although
the supervised SFMs are pre-trained on much more speech recognition data (with
labels), they do not always outperform self-supervised SFMs; the latter tend to
perform at least as well as, and sometimes better than, supervised SFMs,
especially on the sequence generation tasks in SLUE. While there is no
universally optimal way of incorporating SFMs, the complex prediction head
gives the best performance for most tasks, although it increases the inference
time. We also introduce an open-source toolkit and performance leaderboard,
SLUE-PERB, for these tasks and modeling strategies.
comment: Accepted at ACL Findings 2024
☆ Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content
Transition Relevance Places are defined as the end of an utterance where the
interlocutor may take the floor without interrupting the current speaker
--i.e., a place where the turn is terminal. Analyzing turn terminality is
useful to study the dynamic of turn-taking in spontaneous conversations. This
paper presents an automatic classification of spoken utterances as Terminal or
Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of
both approaches on a French corpus of TV and Radio extracts annotated with
turn-terminality information at each speaker change. Our models are based on
pre-trained self-supervised representations. We report results for different
fusion strategies and varying context sizes. This study also questions the
problem of performance variability by analyzing the differences in results for
multiple training runs with random initialization. The measured accuracy would
allow the use of these models for large-scale analysis of turn-taking.
comment: keywords : Spoken interaction, Media, TV, Radio, Transition-Relevance
Places, Turn Taking, Interruption. Accepted to InterSpeech 2024, Kos Island,
Greece
☆ Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection INTERSPEECH 2024
As a robust and large-scale multilingual speech recognition model, Whisper
has demonstrated impressive results in many low-resource and
out-of-distribution scenarios. However, its encoder-decoder structure hinders
its application to streaming speech recognition. In this paper, we introduce
Simul-Whisper, which uses the time alignment embedded in Whisper's
cross-attention to guide auto-regressive decoding and achieve chunk-based
streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we
observe the negative effect of the truncated words at the chunk boundaries on
the decoding results and propose an integrate-and-fire-based truncation
detection model to address this issue. Experiments on multiple languages and
Whisper architectures show that Simul-Whisper achieves an average absolute word
error rate degradation of only 1.46% at a chunk size of 1 second, which
significantly outperforms the current state-of-the-art baseline.
comment: Accepted by INTERSPEECH 2024
☆ FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain
This paper describes the inference system of FZI-WIM at the SemEval-2024 Task
2: Safe Biomedical Natural Language Inference for Clinical Trials. Our system
utilizes the chain of thought (CoT) paradigm to tackle this complex reasoning
problem and further improves the CoT performance with self-consistency. Instead
of greedy decoding, we sample multiple reasoning chains with the same prompt
and make the final verification with majority voting. The self-consistent CoT
system achieves a baseline F1 score of 0.80 (1st), faithfulness score of 0.90
(3rd), and consistency score of 0.73 (12th). We release the code and data
publicly https://github.com/jens5588/FZI-WIM-NLI4CT.
☆ Deep Bayesian Active Learning for Preference Modeling in Large Language Models
Leveraging human preferences for steering the behavior of Large Language
Models (LLMs) has demonstrated notable success in recent years. Nonetheless,
data selection and labeling are still a bottleneck for these systems,
particularly at large scale. Hence, selecting the most informative points for
acquiring human feedback may considerably reduce the cost of preference
labeling and unleash the further development of LLMs. Bayesian Active Learning
provides a principled framework for addressing this challenge and has
demonstrated remarkable success in diverse settings. However, previous attempts
to employ it for Preference Modeling did not meet such expectations. In this
work, we identify that naive epistemic uncertainty estimation leads to the
acquisition of redundant samples. We address this by proposing the Bayesian
Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition
policy that not only targets points of high epistemic uncertainty according to
the preference model but also seeks to maximize the entropy of the acquired
prompt distribution in the feature space spanned by the employed LLM. Notably,
our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference
labels in two popular human preference datasets and exceeds previous stochastic
Bayesian acquisition policies.
☆ Group and Shuffle: Efficient Structured Orthogonal Parametrization
The increasing size of neural networks has led to a growing demand for
methods of efficient fine-tuning. Recently, an orthogonal fine-tuning paradigm
was introduced that uses orthogonal matrices for adapting the weights of a
pretrained model. In this paper, we introduce a new class of structured
matrices, which unifies and generalizes structured classes from previous works.
We examine properties of this class and build a structured orthogonal
parametrization upon it. We then use this parametrization to modify the
orthogonal fine-tuning framework, improving parameter and computational
efficiency. We empirically validate our method on different domains, including
adapting of text-to-image diffusion models and downstream task fine-tuning in
language modeling. Additionally, we adapt our construction for orthogonal
convolutions and conduct experiments with 1-Lipschitz neural networks.
☆ Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models
In the realm of multimodal tasks, Visual Question Answering (VQA) plays a
crucial role by addressing natural language questions grounded in visual
content. Knowledge-Based Visual Question Answering (KBVQA) advances this
concept by adding external knowledge along with images to respond to questions.
We introduce an approach for KBVQA, augmenting the existing vision-language
transformer encoder-decoder (OFA) model. Our main contribution involves
enhancing questions by incorporating relevant external knowledge extracted from
knowledge graphs, using a dynamic triple extraction method. We supply a
flexible number of triples from the knowledge graph as context, tailored to
meet the requirements for answering the question. Our model, enriched with
knowledge, demonstrates an average improvement of 4.75\% in Exact Match Score
over the state-of-the-art on three different KBVQA datasets. Through
experiments and analysis, we demonstrate that furnishing variable triples for
each question improves the reasoning capabilities of the language model in
contrast to supplying a fixed number of triples. This is illustrated even for
recent large language models. Additionally, we highlight the model's
generalization capability by showcasing its SOTA-beating performance on a small
dataset, achieved through straightforward fine-tuning.
comment: 16 pages, 12 figures
☆ Details Make a Difference: Object State-Sensitive Neurorobotic Task Planning
The state of an object reflects its current status or condition and is
important for a robot's task planning and manipulation. However, detecting an
object's state and generating a state-sensitive plan for robots is challenging.
Recently, pre-trained Large Language Models (LLMs) and Vision-Language Models
(VLMs) have shown impressive capabilities in generating plans. However, to the
best of our knowledge, there is hardly any investigation on whether LLMs or
VLMs can also generate object state-sensitive plans. To study this, we
introduce an Object State-Sensitive Agent (OSSA), a task-planning agent
empowered by pre-trained neural networks. We propose two methods for OSSA: (i)
a modular model consisting of a pre-trained vision processing module (dense
captioning model, DCM) and a natural language processing model (LLM), and (ii)
a monolithic model consisting only of a VLM. To quantitatively evaluate the
performances of the two methods, we use tabletop scenarios where the task is to
clear the table. We contribute a multimodal benchmark dataset that takes object
states into consideration. Our results show that both methods can be used for
object state-sensitive tasks, but the monolithic approach outperforms the
modular approach. The code for OSSA is available at
\url{https://github.com/Xiao-wen-Sun/OSSA}
☆ HIRO: Hierarchical Information Retrieval Optimization
Large Language Models (LLMs) excel in natural language tasks but face
limitations due to static training datasets, resulting in outdated or
contextually shallow responses. Retrieval-Augmented Generation (RAG) addresses
this by integrating real-time external knowledge, enhancing model accuracy and
credibility, especially for knowledge-intensive tasks. However, RAG-enhanced
LLMs struggle with long contexts, causing them to "choke" on information
overload, compromising response quality. Recent RAG applications use
hierarchical data structures for storing documents, organized at various levels
of summarization and information density. In this context, we introduce HIRO
(Hierarchical Information Retrieval Optimization), a novel querying approach
for RAG applications using hierarchical structures for storing documents. HIRO
employs DFS-based recursive similarity score calculation and branch pruning to
minimize the context returned to the LLM without informational loss. HIRO
outperforms existing querying mechanisms on the NarrativeQA dataset by an
absolute performance gain of 10.85%.
☆ Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness ACL 2024
Dialects introduce syntactic and lexical variations in language that occur in
regional or social groups. Most NLP methods are not sensitive to such
variations. This may lead to unfair behavior of the methods, conveying negative
bias towards dialect speakers. While previous work has studied dialect-related
fairness for aspects like hate speech, other aspects of biased language, such
as lewdness, remain fully unexplored. To fill this gap, we investigate
performance disparities between dialects in the detection of five aspects of
biased language and how to mitigate them. To alleviate bias, we present a
multitask learning approach that models dialect language as an auxiliary task
to incorporate syntactic and lexical variations. In our experiments with
African-American English dialect, we provide empirical evidence that
complementing common learning approaches with dialect modeling improves their
fairness. Furthermore, the results suggest that multitask learning achieves
state-of-the-art performance and helps to detect properties of biased language
more reliably.
comment: Accepted to Findings of the Association for Computational
Linguistics: ACL 2024
☆ A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
This research investigates prompt designs of evaluating generated texts using
large language models (LLMs). While LLMs are increasingly used for scoring
various inputs, creating effective prompts for open-ended text evaluation
remains challenging due to model sensitivity and subjectivity in evaluation of
text generation. Our study experimented with different prompt structures,
altering the sequence of output instructions and including explanatory reasons.
We found that the order of presenting reasons and scores significantly
influences LLMs' scoring, with a different level of rule understanding in the
prompt. An additional optimization may enhance scoring alignment if sufficient
data is available. This insight is crucial for improving the accuracy and
consistency of LLM-based evaluations.
comment: Presented in JSAI 2024. The first two authors contributed equally.
arXiv admin note: substantial text overlap with arXiv:2406.02863
☆ Bag of Lies: Robustness in Continuous Pre-training BERT
This study aims to acquire more insights into the continuous pre-training
phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case
study. Since the pandemic emerged after the last update of BERT's pre-training
data, the model has little to no entity knowledge about COVID-19. Using
continuous pre-training, we control what entity knowledge is available to the
model. We compare the baseline BERT model with the further pre-trained variants
on the fact-checking benchmark Check-COVID. To test the robustness of
continuous pre-training, we experiment with several adversarial methods to
manipulate the input data, such as training on misinformation and shuffling the
word order until the input becomes nonsensical. Surprisingly, our findings
reveal that these methods do not degrade, and sometimes even improve, the
model's downstream performance. This suggests that continuous pre-training of
BERT is robust against misinformation. Furthermore, we are releasing a new
dataset, consisting of original texts from academic publications in the
LitCovid repository and their AI-generated false counterparts.
☆ ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang
We introduce a new benchmark, ChartMimic, aimed at assessing the
visually-grounded code generation capabilities of large multimodal models
(LMMs). ChartMimic utilizes information-intensive visual charts and textual
instructions as inputs, requiring LMMs to generate the corresponding code for
chart rendering. ChartMimic includes 1,000 human-curated (figure, instruction,
code) triplets, which represent the authentic chart use cases found in
scientific papers across various domains(e.g., Physics, Computer Science,
Economics, etc). These charts span 18 regular types and 4 advanced types,
diversifying into 191 subcategories. Furthermore, we propose multi-level
evaluation metrics to provide an automatic and thorough assessment of the
output code and the rendered charts. Unlike existing code generation
benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to
harmonize a blend of cognitive capabilities, encompassing visual understanding,
code generation, and cross-modal reasoning. The evaluation of 3 proprietary
models and 11 open-weight models highlights the substantial challenges posed by
ChartMimic. Even the advanced GPT-4V, Claude-3-opus only achieve an average
score of 73.2 and 53.7, respectively, indicating significant room for
improvement. We anticipate that ChartMimic will inspire the development of
LMMs, advancing the pursuit of artificial general intelligence.
comment: Data and code are available at
https://github.com/ChartMimic/ChartMimic
☆ BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe
are formulated as image-to-text retrieval problems, where, given an image, the
models need to select between the correct textual description and a synthetic
hard negative text. In this work we present the Bidirectional Vision-Language
Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic
hard negative image generated from the synthetic text, resulting in two
image-to-text retrieval examples (one for each image) and, more importantly,
two text-to-image retrieval examples (one for each text). Human annotators
filter out ill-formed examples ensuring the validity of the benchmark. The
experiments on BiVLC uncover a weakness of current multimodal models, as they
perform poorly in the text-to-image direction. In fact, when considering both
retrieval directions, the conclusions obtained in previous works change
significantly. In addition to the benchmark, we show that a contrastive model
trained using synthetic images and texts improves the state of the art in
SugarCrepe and in BiVLC for both retrieval directions. The gap to human
performance in BiVLC confirms that Vision-Language Compositionality is still a
challenging problem. BiVLC and code are available at
https://imirandam.github.io/BiVLC_project_page.
☆ An efficient text augmentation approach for contextualized Mandarin speech recognition
Although contextualized automatic speech recognition (ASR) systems are
commonly used to improve the recognition of uncommon words, their effectiveness
is hindered by the inherent limitations of speech-text data availability. To
address this challenge, our study proposes to leverage extensive text-only
datasets and contextualize pre-trained ASR models using a straightforward
text-augmentation (TA) technique, all while keeping computational costs
minimal. In particular, to contextualize a pre-trained CIF-based ASR, we
construct a codebook using limited speech-text data. By utilizing a simple
codebook lookup process, we convert available text-only data into latent text
embeddings. These embeddings then enhance the inputs for the contextualized
ASR. Our experiments on diverse Mandarin test sets demonstrate that our TA
approach significantly boosts recognition performance. The top-performing
system shows relative CER improvements of up to 30% on rare words and 15%
across all words in general.
comment: accepted to interspeech2024
☆ BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh
Large language models (LLMs) often lack culture-specific knowledge of daily
life, especially across diverse regions and non-English languages. Existing
benchmarks for evaluating LLMs' cultural sensitivities are limited to a single
language or collected from online sources such as Wikipedia, which do not
reflect the mundane everyday lifestyles of diverse regions. That is,
information about the food people eat for their birthday celebrations, spices
they typically use, musical instruments youngsters play, or the sports they
practice in school is common cultural knowledge but uncommon in easily
collected online sources, especially for underrepresented cultures. To address
this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate
LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises
52.6k question-answer pairs from 16 countries/regions, in 13 different
languages, including low-resource ones such as Amharic, Assamese, Azerbaijani,
Hausa, and Sundanese. We construct the benchmark to include two formats of
questions: short-answer and multiple-choice. We show that LLMs perform better
for cultures that are highly represented online, with a maximum 57.34%
difference in GPT-4, the best-performing model, in the short-answer format. For
cultures represented by mid-to-high-resource languages, LLMs perform better in
their local languages, but for cultures represented by low-resource languages,
LLMs perform better in English than the local languages. We make our dataset
publicly available at: https://github.com/nlee0212/BLEnD.
☆ Experiments in News Bias Detection with Pre-Trained Neural Transformers
The World Wide Web provides unrivalled access to information globally,
including factual news reporting and commentary. However, state actors and
commercial players increasingly spread biased (distorted) or fake (non-factual)
information to promote their agendas. We compare several large, pre-trained
language models on the task of sentence-level news bias detection and sub-type
classification, providing quantitative and qualitative results.
☆ CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions
The integration of Artificial Intelligence (AI), especially Large Language
Models (LLMs), into the clinical diagnosis process offers significant potential
to improve the efficiency and accessibility of medical care. While LLMs have
shown some promise in the medical domain, their application in clinical
diagnosis remains underexplored, especially in real-world clinical practice,
where highly sophisticated, patient-specific decisions need to be made. Current
evaluations of LLMs in this field are often narrow in scope, focusing on
specific diseases or specialties and employing simplified diagnostic tasks. To
bridge this gap, we introduce CliBench, a novel benchmark developed from the
MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs'
capabilities in clinical diagnosis. This benchmark not only covers diagnoses
from a diverse range of medical cases across various specialties but also
incorporates tasks of clinical significance: treatment procedure
identification, lab test ordering and medication prescriptions. Supported by
structured output ontologies, CliBench enables a precise and multi-granular
evaluation, offering an in-depth understanding of LLM's capability on diverse
clinical tasks of desired granularity. We conduct a zero-shot evaluation of
leading LLMs to assess their proficiency in clinical decision-making. Our
preliminary results shed light on the potential and limitations of current LLMs
in clinical settings, providing valuable insights for future advancements in
LLM-powered healthcare.
comment: Project page: https://clibench.github.io
☆ Knowledge Editing in Language Models via Adapted Direct Preference Optimization
Large Language Models (LLMs) can become outdated over time as they may lack
updated world knowledge, leading to factual knowledge errors and gaps.
Knowledge Editing (KE) aims to overcome this challenge using weight updates
that do not require expensive retraining. We propose treating KE as an LLM
alignment problem. Toward this goal, we introduce Knowledge Direct Preference
Optimization (KDPO), a variation of the Direct Preference Optimization (DPO)
that is more effective for knowledge modifications. Our method is based on an
online approach that continually updates the knowledge stored in the model. We
use the current knowledge as a negative sample and the new knowledge we want to
introduce as a positive sample in a process called DPO. We also use
teacher-forcing for negative sample generation and optimize using the positive
sample, which helps maintain localized changes. We tested our KE method on
various datasets and models, comparing it to several cutting-edge methods, with
100 and 500 sequential edits. Additionally, we conducted an ablation study
comparing our method to the standard DPO approach. Our experimental results
show that our modified DPO method allows for more refined KE, achieving similar
or better performance compared to previous methods.
comment: 9 pages, 4 figures
☆ GEB-1.3B: Open Lightweight Large Language Model
Recently developed large language models (LLMs) such as ChatGPT, Claude, and
Llama have demonstrated impressive abilities, and even surpass human-level
performance in several tasks. Despite their success, the resource-intensive
demands of these models, requiring significant computational power for both
training and inference, limit their deployment to high-performance servers.
Additionally, the extensive calculation requirements of the models often lead
to increased latency in response times. With the increasing need for LLMs to
operate efficiently on CPUs, research about lightweight models that are
optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a
lightweight LLM trained on 550 billion tokens in both Chinese and English
languages. We employ novel training techniques, including ROPE,
Group-Query-Attention, and FlashAttention-2, to accelerate training while
maintaining model performance. Additionally, we fine-tune the model using 10
million samples of instruction data to enhance alignment. GEB-1.3B exhibits
outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU,
outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B.
Notably, the FP32 version of GEB-1.3B achieves commendable inference times on
CPUs, with ongoing efforts to further enhance speed through advanced
quantization techniques. The release of GEB-1.3B as an open-source model marks
a significant contribution to the development of lightweight LLMs, promising to
foster further research and innovation in the field.
comment: GEB-1.3B technical report
☆ 3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding
Inspired by the Bloch Sphere representation, we propose a novel rotary
position encoding on a three-dimensional sphere, named 3D Rotary Position
Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary
Position Encoding (RoPE), with two major advantages for modeling long contexts:
controllable long-term decay and improved position resolution. For controllable
long-term decay, 3D-RPE allows for the regulation of long-term decay within the
chunk size, ensuring the modeling of relative positional information between
tokens at a distant relative position. For enhanced position resolution, 3D-RPE
can mitigate the degradation of position resolution caused by position
interpolation on RoPE. We have conducted experiments on long-context Natural
Language Understanding (NLU) and long-sequence Language Modeling (LM) tasks.
From the experimental results, 3D-RPE achieved performance improvements over
RoPE, especially in long-context NLU tasks.
☆ A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation ECML-PKDD
Current state-of-the-art dialogue systems heavily rely on extensive training
datasets. However, challenges arise in domains where domain-specific training
datasets are insufficient or entirely absent. To tackle this challenge, we
propose a novel data \textbf{A}ugmentation framework for
\textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred
to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation
process and a two-stage training approach: domain-agnostic training and domain
adaptation training. We posit that domain corpora are a blend of
domain-agnostic and domain-specific features, with certain representation
patterns shared among diverse domains. Domain-agnostic training aims to enable
models to learn these common expressive patterns. To construct domain-agnostic
dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing
technique used to remove domain-specific features. By mitigating the effects of
domain-specific features, the model trained on the de-domained corpora can
effectively learn common expression patterns in different domains.
Subsequently, we adapt the learned domain-agnostic features to the target
domain through domain adaptation training. We conduct experiments on Chinese
dialogue datasets from five different domains and show that AMD$^2$G achieves
superior performance compared to both direct training on the target domain
corpus and collective training on all five domain corpora. Our work underscores
AMD$^2$G as a viable alternative solution for low-resource multi-domain
dialogue generation. Code and data associated with our work are available on
GitHub repository$^{\text 1}$.
comment: 17pages,ECML-PKDD
☆ LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data
Multimodal Deep Learning enhances decision-making by integrating diverse
information sources, such as texts, images, audio, and videos. To develop
trustworthy multimodal approaches, it is essential to understand how
uncertainty impacts these models. We introduce LUMA, a unique benchmark
dataset, featuring audio, image, and textual data from 50 classes, for learning
from uncertain and multimodal data. It extends the well-known CIFAR 10/100
dataset with audio samples extracted from three audio corpora, and text data
generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset
enables the controlled injection of varying types and degrees of uncertainty to
achieve and tailor specific experiments and benchmarking initiatives. LUMA is
also available as a Python package including the functions for generating
multiple variants of the dataset with controlling the diversity of the data,
the amount of noise for each modality, and adding out-of-distribution samples.
A baseline pre-trained model is also provided alongside three uncertainty
quantification methods: Monte-Carlo Dropout, Deep Ensemble, and Reliable
Conflictive Multi-View Learning. This comprehensive dataset and its tools are
intended to promote and support the development and benchmarking of trustworthy
and robust multimodal deep learning approaches.
☆ On the Encoding of Gender in Transformer-based ASR Representations
While existing literature relies on performance differences to uncover gender
biases in ASR models, a deeper analysis is essential to understand how gender
is encoded and utilized during transcript generation. This work investigates
the encoding and utilization of gender in the latent representations of two
transformer-based ASR models, Wav2Vec2 and HuBERT. Using linear erasure, we
demonstrate the feasibility of removing gender information from each layer of
an ASR model and show that such an intervention has minimal impacts on the ASR
performance. Additionally, our analysis reveals a concentration of gender
information within the first and last frames in the final layers, explaining
the ease of erasing gender in these layers. Our findings suggest the prospect
of creating gender-neutral embeddings that can be integrated into ASR
frameworks without compromising their efficacy.
comment: Accepted at Interspeech 2024
☆ Rapport-Driven Virtual Agent: Rapport Building Dialogue Strategy for Improving User Experience at First Meeting INTERSPEECH 2024
Rapport is known as a conversational aspect focusing on relationship
building, which influences outcomes in collaborative tasks. This study aims to
establish human-agent rapport through small talk by using a rapport-building
strategy. We implemented this strategy for the virtual agents based on dialogue
strategies by prompting a large language model (LLM). In particular, we
utilized two dialogue strategies-predefined sequence and free-form-to guide the
dialogue generation framework. We conducted analyses based on human
evaluations, examining correlations between total turn, utterance characters,
rapport score, and user experience variables: naturalness, satisfaction,
interest, engagement, and usability. We investigated correlations between
rapport score and naturalness, satisfaction, engagement, and conversation flow.
Our experimental results also indicated that using free-form to prompt the
rapport-building strategy performed the best in subjective scores.
comment: will be presented at INTERSPEECH 2024
☆ Federated Learning driven Large Language Models for Swarm Intelligence: A Survey
Federated learning (FL) offers a compelling framework for training large
language models (LLMs) while addressing data privacy and decentralization
challenges. This paper surveys recent advancements in the federated learning of
large language models, with a particular focus on machine unlearning, a crucial
aspect for complying with privacy regulations like the Right to be Forgotten.
Machine unlearning in the context of federated LLMs involves systematically and
securely removing individual data contributions from the learned model without
retraining from scratch. We explore various strategies that enable effective
unlearning, such as perturbation techniques, model decomposition, and
incremental learning, highlighting their implications for maintaining model
performance and data privacy. Furthermore, we examine case studies and
experimental results from recent literature to assess the effectiveness and
efficiency of these approaches in real-world scenarios. Our survey reveals a
growing interest in developing more robust and scalable federated unlearning
methods, suggesting a vital area for future research in the intersection of AI
ethics and distributed machine learning technologies.
☆ HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
In modern large language models (LLMs), increasing sequence lengths is a
crucial challenge for enhancing their comprehension and coherence in handling
complex tasks such as multi-modal question answering. However, handling long
context sequences with LLMs is prohibitively costly due to the conventional
attention mechanism's quadratic time and space complexity, and the context
window size is limited by the GPU memory. Although recent works have proposed
linear and sparse attention mechanisms to address this issue, their real-world
applicability is often limited by the need to re-train pre-trained models. In
response, we propose a novel approach, Hierarchically Pruned Attention (HiP),
which simultaneously reduces the training and inference time complexity from
$O(T^2)$ to $O(T \log T)$ and the space complexity from $O(T^2)$ to $O(T)$. To
this end, we devise a dynamic sparse attention mechanism that generates an
attention mask through a novel tree-search-like algorithm for a given query on
the fly. HiP is training-free as it only utilizes the pre-trained attention
scores to spot the positions of the top-$k$ most significant elements for each
query. Moreover, it ensures that no token is overlooked, unlike the sliding
window-based sub-quadratic attention methods, such as StreamingLLM. Extensive
experiments on diverse real-world benchmarks demonstrate that HiP significantly
reduces prompt (i.e., prefill) and decoding latency and memory usage while
maintaining high generation performance with little or no degradation. As HiP
allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no
additional engineering due to its easy plug-and-play deployment, we believe
that our work will have a large practical impact, opening up the possibility to
many long-context LLM applications previously infeasible.
comment: 26 pages, 15 figures
☆ Retrieval Augmented Fact Verification by Synthesizing Contrastive Arguments ACL 2024
The rapid propagation of misinformation poses substantial risks to public
interest. To combat misinformation, large language models (LLMs) are adapted to
automatically verify claim credibility. Nevertheless, existing methods heavily
rely on the embedded knowledge within LLMs and / or black-box APIs for evidence
collection, leading to subpar performance with smaller LLMs or upon unreliable
context. In this paper, we propose retrieval augmented fact verification
through the synthesis of contrasting arguments (RAFTS). Upon input claims,
RAFTS starts with evidence retrieval, where we design a retrieval pipeline to
collect and re-rank relevant documents from verifiable sources. Then, RAFTS
forms contrastive arguments (i.e., supporting or refuting) conditioned on the
retrieved evidence. In addition, RAFTS leverages an embedding model to identify
informative demonstrations, followed by in-context prompting to generate the
prediction and explanation. Our method effectively retrieves relevant documents
as evidence and evaluates arguments from varying perspectives, incorporating
nuanced information for fine-grained decision-making. Combined with informative
in-context examples as prior, RAFTS achieves significant improvements to
supervised and LLM baselines without complex prompts. We demonstrate the
effectiveness of our method through extensive experiments, where RAFTS can
outperform GPT-based methods with a significantly smaller 7B LLM.
comment: Accepted to ACL 2024
☆ Pcc-tuning: Breaking the Contrastive Learning Ceiling in Semantic Textual Similarity
Semantic Textual Similarity (STS) constitutes a critical research direction
in computational linguistics and serves as a key indicator of the encoding
capabilities of embedding models. Driven by advances in pre-trained language
models and contrastive learning techniques, leading sentence representation
methods can already achieved average Spearman's correlation scores of
approximately 86 across seven STS benchmarks in SentEval. However, further
improvements have become increasingly marginal, with no existing method
attaining an average score higher than 87 on these tasks. This paper conducts
an in-depth analysis of this phenomenon and concludes that the upper limit for
Spearman's correlation scores using contrastive learning is 87.5. To transcend
this ceiling, we propose an innovative approach termed Pcc-tuning, which
employs Pearson's correlation coefficient as a loss function to refine model
performance beyond contrastive learning. Experimental results demonstrate that
Pcc-tuning markedly surpasses previous state-of-the-art strategies, raising the
Spearman's correlation score to above 90.
comment: Work in Progress
☆ OSPC: Detecting Harmful Memes with Large Language Model as a Catalyst
Memes, which rapidly disseminate personal opinions and positions across the
internet, also pose significant challenges in propagating social bias and
prejudice. This study presents a novel approach to detecting harmful memes,
particularly within the multicultural and multilingual context of Singapore.
Our methodology integrates image captioning, Optical Character Recognition
(OCR), and Large Language Model (LLM) analysis to comprehensively understand
and classify harmful memes. Utilizing the BLIP model for image captioning,
PP-OCR and TrOCR for text recognition across multiple languages, and the Qwen
LLM for nuanced language understanding, our system is capable of identifying
harmful content in memes created in English, Chinese, Malay, and Tamil. To
enhance the system's performance, we fine-tuned our approach by leveraging
additional data labeled using GPT-4V, aiming to distill the understanding
capability of GPT-4V for harmful memes to our system. Our framework achieves
top-1 at the public leaderboard of the Online Safety Prize Challenge hosted by
AI Singapore, with the AUROC as 0.7749 and accuracy as 0.7087, significantly
ahead of the other teams. Notably, our approach outperforms previous
benchmarks, with FLAVA achieving an AUROC of 0.5695 and VisualBERT an AUROC of
0.5561.
☆ Application of Natural Language Processing in Financial Risk Detection
This paper explores the application of Natural Language Processing (NLP) in
financial risk detection. By constructing an NLP-based financial risk detection
model, this study aims to identify and predict potential risks in financial
documents and communications. First, the fundamental concepts of NLP and its
theoretical foundation, including text mining methods, NLP model design
principles, and machine learning algorithms, are introduced. Second, the
process of text data preprocessing and feature extraction is described.
Finally, the effectiveness and predictive performance of the model are
validated through empirical research. The results show that the NLP-based
financial risk detection model performs excellently in risk identification and
prediction, providing effective risk management tools for financial
institutions. This study offers valuable references for the field of financial
risk management, utilizing advanced NLP techniques to improve the accuracy and
efficiency of financial risk detection.
☆ Bootstrapping Language Models with DPO Implicit Rewards
Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin
Human alignment in large language models (LLMs) is an active area of
research. A recent groundbreaking work, direct preference optimization (DPO),
has greatly simplified the process from past work in reinforcement learning
from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO,
after training, provides an implicit reward model. In this work, we make a
novel observation that this implicit reward model can by itself be used in a
bootstrapping fashion to further align the LLM. Our approach is to use the
rewards from a current LLM model to construct a preference dataset, which is
then used in subsequent DPO rounds. We incorporate refinements that debias the
length of the responses and improve the quality of the preference dataset to
further improve our approach. Our approach, named self-alignment with DPO
ImpliCit rEwards (DICE), shows great improvements in alignment and achieves
superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55%
length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and
no external feedback. Our code is available at https://github.com/sail-sg/dice.
☆ Self-Knowledge Distillation for Learning Ambiguity
Recent language models have shown remarkable performance on natural language
understanding (NLU) tasks. However, they are often sub-optimal when faced with
ambiguous samples that can be interpreted in multiple ways, over-confidently
predicting a single label without consideration for its correctness. To address
this issue, we propose a novel self-knowledge distillation method that enables
models to learn label distributions more accurately by leveraging knowledge
distilled from their lower layers. This approach also includes a learning phase
that re-calibrates the unnecessarily strengthened confidence for training
samples judged as extremely ambiguous based on the distilled distribution
knowledge. We validate our method on diverse NLU benchmark datasets and the
experimental results demonstrate its effectiveness in producing better label
distributions. Particularly, through the process of re-calibrating the
confidence for highly ambiguous samples, the issue of over-confidence when
predictions for unseen samples do not match with their ground-truth labels has
been significantly alleviated. This has been shown to contribute to generating
better distributions than the existing state-of-the-art method. Moreover, our
method is more efficient in training the models compared to the existing
method, as it does not involve additional training processes to refine label
distributions.
comment: 9 pages, 5 figures
☆ UniBridge: A Unified Approach to Cross-Lingual Transfer Learning for Low-Resource Languages
In this paper, we introduce UniBridge (Cross-Lingual Transfer Learning with
Optimized Embeddings and Vocabulary), a comprehensive approach developed to
improve the effectiveness of Cross-Lingual Transfer Learning, particularly in
languages with limited resources. Our approach tackles two essential elements
of a language model: the initialization of embeddings and the optimal
vocabulary size. Specifically, we propose a novel embedding initialization
method that leverages both lexical and semantic alignment for a language. In
addition, we present a method for systematically searching for the optimal
vocabulary size, ensuring a balance between model complexity and linguistic
coverage. Our experiments across multilingual datasets show that our approach
greatly improves the F1-Score in several languages. UniBridge is a robust and
adaptable solution for cross-lingual systems in various languages, highlighting
the significance of initializing embeddings and choosing the right vocabulary
size in cross-lingual environments.
comment: 16 pages
☆ Detecting Response Generation Not Requiring Factual Judgment
With the remarkable development of large language models (LLMs), ensuring the
factuality of output has become a challenge. However, having all the contents
of the response with given knowledge or facts is not necessarily a good thing
in dialogues. This study aimed to achieve both attractiveness and factuality in
a dialogue response for which a task was set to predict sentences that do not
require factual correctness judgment such as agreeing, or personal
opinions/feelings. We created a dataset, dialogue dataset annotated with
fact-check-needed label (DDFC), for this task via crowdsourcing, and
classification tasks were performed on several models using this dataset. The
model with the highest classification accuracy could yield about 88% accurate
classification results.
☆ FreeCtrl: Constructing Control Centers with Feedforward Layers for Learning-Free Controllable Text Generation ACL 2024
Controllable text generation (CTG) seeks to craft texts adhering to specific
attributes, traditionally employing learning-based techniques such as training,
fine-tuning, or prefix-tuning with attribute-specific datasets. These
approaches, while effective, demand extensive computational and data resources.
In contrast, some proposed learning-free alternatives circumvent learning but
often yield inferior results, exemplifying the fundamental machine learning
trade-off between computational expense and model efficacy. To overcome these
limitations, we propose FreeCtrl, a learning-free approach that dynamically
adjusts the weights of selected feedforward neural network (FFN) vectors to
steer the outputs of large language models (LLMs). FreeCtrl hinges on the
principle that the weights of different FFN vectors influence the likelihood of
different tokens appearing in the output. By identifying and adaptively
adjusting the weights of attribute-related FFN vectors, FreeCtrl can control
the output likelihood of attribute keywords in the generated content. Extensive
experiments on single- and multi-attribute control reveal that the
learning-free FreeCtrl outperforms other learning-free and learning-based
methods, successfully resolving the dilemma between learning costs and model
performance.
comment: ACL 2024
☆ Optimizing Byte-level Representation for End-to-end ASR
We propose a novel approach to optimizing a byte-level representation for
end-to-end automatic speech recognition (ASR). Byte-level representation is
often used by large scale multilingual ASR systems when the character set of
the supported languages is large. The compactness and universality of
byte-level representation allow the ASR models to use smaller output
vocabularies and therefore, provide more flexibility. UTF-8 is a commonly used
byte-level representation for multilingual ASR, but it is not designed to
optimize machine learning tasks directly. By using auto-encoder and vector
quantization, we show that we can optimize a byte-level representation for ASR
and achieve better accuracy. Our proposed framework can incorporate information
from different modalities, and provides an error correction mechanism. In an
English/Mandarin dictation task, we show that a bilingual ASR model built with
this approach can outperform UTF-8 representation by 5% relative in error rate.
comment: 5 pages, 1 figure
☆ Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam
The recent integration of visual capabilities into Large Language Models
(LLMs) has the potential to play a pivotal role in science and technology
education, where visual elements such as diagrams, charts, and tables are
commonly used to improve the learning experience. This study investigates the
performance of ChatGPT-4 Vision, OpenAI's most advanced visual model at the
time the study was conducted, on the Bachelor in Computer Science section of
Brazil's 2021 National Undergraduate Exam (ENADE). By presenting the model with
the exam's open and multiple-choice questions in their original image format
and allowing for reassessment in response to differing answer keys, we were
able to evaluate the model's reasoning and self-reflecting capabilities in a
large-scale academic assessment involving textual and visual content. ChatGPT-4
Vision significantly outperformed the average exam participant, positioning
itself within the top 10 best score percentile. While it excelled in questions
that incorporated visual elements, it also encountered challenges with question
interpretation, logical reasoning, and visual acuity. The involvement of an
independent expert panel to review cases of disagreement between the model and
the answer key revealed some poorly constructed questions containing vague or
ambiguous statements, calling attention to the critical need for improved
question design in future exams. Our findings suggest that while ChatGPT-4
Vision shows promise in multimodal academic evaluations, human oversight
remains crucial for verifying the model's accuracy and ensuring the fairness of
high-stakes educational exams. The paper's research materials are publicly
available at https://github.com/nabormendonca/gpt-4v-enade-cs-2021.
comment: Accepted for publication
☆ Learning Language Structures through Grounding
Language is highly structured, with syntactic and semantic structures, to
some extent, agreed upon by speakers of the same language. With implicit or
explicit awareness of such structures, humans can learn and use language
efficiently and generalize to sentences that contain unseen words. Motivated by
human language learning, in this dissertation, we consider a family of machine
learning tasks that aim to learn language structures through grounding. We seek
distant supervision from other data sources (i.e., grounds), including but not
limited to other modalities (e.g., vision), execution results of programs, and
other languages.
We demonstrate the potential of this task formulation and advocate for its
adoption through three schemes. In Part I, we consider learning syntactic
parses through visual grounding. We propose the task of visually grounded
grammar induction, present the first models to induce syntactic structures from
visually grounded text and speech, and find that the visual grounding signals
can help improve the parsing quality over language-only models. As a side
contribution, we propose a novel evaluation metric that enables the evaluation
of speech parsing without text or automatic speech recognition systems
involved. In Part II, we propose two execution-aware methods to map sentences
into corresponding semantic structures (i.e., programs), significantly
improving compositional generalization and few-shot program synthesis. In Part
III, we propose methods that learn language structures from annotations in
other languages. Specifically, we propose a method that sets a new state of the
art on cross-lingual word alignment. We then leverage the learned word
alignments to improve the performance of zero-shot cross-lingual dependency
parsing, by proposing a novel substructure-based projection method that
preserves structural knowledge learned from the source language.
comment: Ph.D. Thesis
♻ ☆ COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning
We present a cost-effective method to integrate speech into a large language
model (LLM), resulting in a Contextual Speech Model with
Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal
LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA)
pairs from speech transcriptions for supervised instruction tuning. With under
30 million trainable parameters and only 450 hours of English speech data,
COSMIC demonstrates emerging capabilities in instruction-following and
in-context learning. Equipped with such capabilities, COSMIC achieves a maximum
33.18 BLEU score in 0-shot EN-to-X speech to text translation (S2TT) and a
significant boost in the 1-shot setting. Additionally, there is an average
25.8\% relative Word Error Rate (WER) reduction for 1-shot cross-domain
adaptation. COSMIC exhibits a significant automatic speech recognition (ASR)
accuracy gain in contextual biasing tasks due to its instruction-following
capability.
♻ ☆ CausalChaos! Dataset for Comprehensive Causal Action Question Answering Over Longer Causal Chains Grounded in Dynamic Visual Scenes
Causal video question answering (QA) has garnered increasing interest, yet
existing datasets often lack depth in causal reasoning. To address this gap, we
capitalize on the unique properties of cartoons and construct CausalChaos!, a
novel, challenging causal Why-QA dataset built upon the iconic "Tom and Jerry"
cartoon series. Cartoons use the principles of animation that allow animators
to create expressive, unambiguous causal relationships between events to form a
coherent storyline. Utilizing these properties, along with thought-provoking
questions and multi-level answers (answer and detailed causal explanation), our
questions involve causal chains that interconnect multiple dynamic interactions
between characters and visual scenes. These factors demand models to solve more
challenging, yet well-defined causal relationships. We also introduce hard
incorrect answer mining, including a causally confusing version that is even
more challenging. While models perform well, there is much room for
improvement, especially, on open-ended answers. We identify more
advanced/explicit causal relationship modeling & joint modeling of vision and
language as the immediate areas for future efforts to focus upon. Along with
the other complementary datasets, our new challenging dataset will pave the way
for these developments in the field.
comment: Project Page: https://github.com/LUNAProject22/CausalChaos
♻ ☆ Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights
Severe data imbalance naturally exists among web-scale vision-language
datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable
robustness to the data imbalance compared to supervised learning, and
demonstrates significant effectiveness in learning generalizable
representations. With an aim to investigate the reasons behind this finding, we
conduct controlled experiments to study various underlying factors, and reveal
that CLIP's pretext task forms a dynamic classification problem wherein only a
subset of classes is present in training. This isolates the bias from dominant
classes and implicitly balances the learning signal. Furthermore, the
robustness and discriminability of CLIP improve with more descriptive language
supervision, larger data scale, and broader open-world concepts, which are
inaccessible to supervised learning. Our study not only uncovers the mechanisms
behind CLIP's generalizability beyond data imbalance but also provides
transferable insights for the research community. The findings are validated in
both supervised and self-supervised learning, enabling models trained on
imbalanced data to achieve CLIP-level performance on diverse recognition tasks.
Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail.
♻ ☆ Towards the TopMost: A Topic Modeling System Toolkit ACL 2024
Topic models have a rich history with various applications and have recently
been reinvigorated by neural topic modeling. However, these numerous topic
models adopt totally distinct datasets, implementations, and evaluations. This
impedes quick utilization and fair comparisons, and thereby hinders their
research progress and applications. To tackle this challenge, we in this paper
propose a Topic Modeling System Toolkit (TopMost). Compared to existing
toolkits, TopMost stands out by supporting more extensive features. It covers a
broader spectrum of topic modeling scenarios with their complete lifecycles,
including datasets, preprocessing, models, training, and evaluations. Thanks to
its highly cohesive and decoupled modular design, TopMost enables rapid
utilization, fair comparisons, and flexible extensions of diverse cutting-edge
topic models. Our code, tutorials, and documentation are available at
https://github.com/bobxwu/topmost.
comment: Accepted to ACL 2024 System Demonstrations Track
♻ ☆ ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models
Performance prediction is a method to estimate the performance of Language
Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating
computational costs associated with model capacity and data for fine-tuning.
Our paper introduces ProxyLM, a scalable framework for predicting LM
performance using proxy models in multilingual tasks. These proxy models act as
surrogates, approximating the performance of the LM of interest. By leveraging
proxy models, ProxyLM significantly reduces computational overhead on task
evaluations, achieving up to a 37.08x speedup compared to traditional methods,
even with our smallest proxy models. Additionally, our methodology showcases
adaptability to previously unseen languages in pre-trained LMs, outperforming
the state-of-the-art performance by 1.89x as measured by root-mean-square error
(RMSE). This framework streamlines model selection, enabling efficient
deployment and iterative LM enhancements without extensive computational
resources.
comment: Preprint
♻ ☆ EUROPA: A Legal Multilingual Keyphrase Generation Dataset ACL 2024
Keyphrase generation has primarily been explored within the context of
academic research articles, with a particular focus on scientific domains and
the English language. In this work, we present EUROPA, a dataset for
multilingual keyphrase generation in the legal domain. It is derived from legal
judgments from the Court of Justice of the European Union (EU), and contains
instances in all 24 EU official languages. We run multilingual models on our
corpus and analyze the results, showing room for improvement on a
domain-specific multilingual corpus such as the one we present.
comment: 19 pages, 2 figures, accepted at ACL 2024
♻ ☆ Towards Robust Instruction Tuning on Multimodal Large Language Models
Fine-tuning large language models (LLMs) on multi-task instruction-following
data has been proven to be a powerful learning paradigm for improving their
zero-shot capabilities on new tasks. Recent works about high-quality
instruction-following data generation and selection require amounts of human
labor to conceive model-understandable instructions for the given tasks and
carefully filter the LLM-generated data. In this work, we introduce an
automatic instruction augmentation method named INSTRAUG in multimodal tasks.
It starts from a handful of basic and straightforward meta instructions but can
expand an instruction-following dataset by 30 times. Results on two popular
multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show
that INSTRAUG can significantly improve the alignment of multimodal large
language models (MLLMs) across 12 multimodal tasks, which is even equivalent to
the benefits of scaling up training data multiple times.
comment: 24 pages, 7 figures
♻ ☆ FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
We introduce FinTral, a suite of state-of-the-art multimodal large language
models (LLMs) built upon the Mistral-7b model and tailored for financial
analysis. FinTral integrates textual, numerical, tabular, and image data. We
enhance FinTral with domain-specific pretraining, instruction fine-tuning, and
RLAIF training by exploiting a large collection of textual and visual datasets
we curate for this work. We also introduce an extensive benchmark featuring
nine tasks and 25 datasets for evaluation, including hallucinations in the
financial domain. Our FinTral model trained with direct preference optimization
employing advanced Tools and Retrieval methods, dubbed FinTral-DPO-T&R,
demonstrates an exceptional zero-shot performance. It outperforms ChatGPT-3.5
in all tasks and surpasses GPT-4 in five out of nine tasks, marking a
significant advancement in AI-driven financial technology. We also demonstrate
that FinTral has the potential to excel in real-time analysis and
decision-making in diverse financial contexts. The GitHub repository for
FinTral is available at \url{https://github.com/UBC-NLP/fintral}.
♻ ☆ A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models
As one of the most advanced techniques in AI, Retrieval-Augmented Generation
(RAG) can offer reliable and up-to-date external knowledge, providing huge
convenience for numerous tasks. Particularly in the era of AI-Generated Content
(AIGC), the powerful capacity of retrieval in providing additional knowledge
enables RAG to assist existing generative AI in producing high-quality outputs.
Recently, Large Language Models (LLMs) have demonstrated revolutionary
abilities in language understanding and generation, while still facing inherent
limitations, such as hallucinations and out-of-date internal knowledge. Given
the powerful abilities of RAG in providing the latest and helpful auxiliary
information, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged
to harness external and authoritative knowledge bases, rather than solely
relying on the model's internal knowledge, to augment the generation quality of
LLMs. In this survey, we comprehensively review existing research studies in
RA-LLMs, covering three primary technical perspectives: architectures, training
strategies, and applications. As the preliminary knowledge, we briefly
introduce the foundations and recent advances of LLMs. Then, to illustrate the
practical significance of RAG for LLMs, we systematically review mainstream
relevant work by their architectures, training strategies, and application
areas, detailing specifically the challenges of each and the corresponding
capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss
current limitations and several promising directions for future research.
Updated information about this survey can be found at
https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
♻ ☆ Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis NAACL 2024
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li
Large language models (LLMs) have demonstrated remarkable potential in
handling multilingual machine translation (MMT). In this paper, we
systematically investigate the advantages and challenges of LLMs for MMT by
answering two questions: 1) How well do LLMs perform in translating massive
languages? 2) Which factors affect LLMs' performance in translation? We
thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4. Our
empirical results show that translation capabilities of LLMs are continually
involving. GPT-4 has beat the strong supervised baseline NLLB in 40.91% of
translation directions but still faces a large gap towards the commercial
translation system like Google Translate, especially on low-resource languages.
Through further analysis, we discover that LLMs exhibit new working patterns
when used for MMT. First, LLM can acquire translation ability in a
resource-efficient way and generate moderate translation even on zero-resource
languages. Second, instruction semantics can surprisingly be ignored when given
in-context exemplars. Third, cross-lingual exemplars can provide better task
guidance for low-resource translation than exemplars in the same language
pairs. Code will be released at: https://github.com/NJUNLP/MMT-LLM.
comment: Accepted to Findings of NAACL 2024
♻ ☆ LimGen: Probing the LLMs for Generating Suggestive Limitations of Research Papers ECML-PKDD 2024
Examining limitations is a crucial step in the scholarly research reviewing
process, revealing aspects where a study might lack decisiveness or require
enhancement. This aids readers in considering broader implications for further
research. In this article, we present a novel and challenging task of
Suggestive Limitation Generation (SLG) for research papers. We compile a
dataset called \textbf{\textit{LimGen}}, encompassing 4068 research papers and
their associated limitations from the ACL anthology. We investigate several
approaches to harness large language models (LLMs) for producing suggestive
limitations, by thoroughly examining the related challenges, practical
insights, and potential opportunities. Our LimGen dataset and code can be
accessed at \url{https://github.com/arbmf/LimGen}.
comment: Accepted at ECML-PKDD 2024
♻ ☆ FinDABench: Benchmarking Financial Data Analysis Ability of Large Language Models
Shu Liu, Shangqing Zhao, Chenghao Jia, Xinlin Zhuang, Zhaoguang Long, Jie Zhou, Aimin Zhou, Man Lan, Qingquan Wu, Chong Yang
Large Language Models (LLMs) have demonstrated impressive capabilities across
a wide range of tasks. However, their proficiency and reliability in the
specialized domain of financial data analysis, particularly focusing on
data-driven thinking, remain uncertain. To bridge this gap, we introduce
\texttt{FinDABench}, a comprehensive benchmark designed to evaluate the
financial data analysis capabilities of LLMs within this context.
\texttt{FinDABench} assesses LLMs across three dimensions: 1)
\textbf{Foundational Ability}, evaluating the models' ability to perform
financial numerical calculation and corporate sentiment risk assessment; 2)
\textbf{Reasoning Ability}, determining the models' ability to quickly
comprehend textual information and analyze abnormal financial reports; and 3)
\textbf{Technical Skill}, examining the models' use of technical knowledge to
address real-world data analysis challenges involving analysis generation and
charts visualization from multiple perspectives. We will release
\texttt{FinDABench}, and the evaluation scripts at
\url{https://github.com/cubenlp/BIBench}. \texttt{FinDABench} aims to provide a
measure for in-depth analysis of LLM abilities and foster the advancement of
LLMs in the field of financial data analysis.
♻ ☆ Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love, Paul Voigtlaender, Rohan Jain, Gabriela Surita, Kareem Mohamed, Rory Blevins, Junwhan Ahn, Tao Zhu, Kornraphop Kawintiranon, Orhan Firat, Yiming Gu, Yujing Zhang, Matthew Rahtz, Manaal Faruqui, Natalie Clay, Justin Gilmer, JD Co-Reyes, Ivo Penchev, Rui Zhu, Nobuyuki Morioka, Kevin Hui, Krishna Haridasan, Victor Campos, Mahdis Mahdieh, Mandy Guo, Samer Hassan, Kevin Kilgour, Arpi Vezer, Heng-Tze Cheng, Raoul de Liedekerke, Siddharth Goyal, Paul Barham, DJ Strouse, Seb Noury, Jonas Adler, Mukund Sundararajan, Sharad Vikram, Dmitry Lepikhin, Michela Paganini, Xavier Garcia, Fan Yang, Dasha Valter, Maja Trebacz, Kiran Vodrahalli, Chulayuth Asawaroengchai, Roman Ring, Norbert Kalb, Livio Baldini Soares, Siddhartha Brahma, David Steiner, Tianhe Yu, Fabian Mentzer, Antoine He, Lucas Gonzalez, Bibo Xu, Raphael Lopez Kaufman, Laurent El Shafey, Junhyuk Oh, Tom Hennigan, George van den Driessche, Seth Odoom, Mario Lucic, Becca Roelofs, Sid Lall, Amit Marathe, Betty Chan, Santiago Ontanon, Luheng He, Denis Teplyashin, Jonathan Lai, Phil Crone, Bogdan Damoc, Lewis Ho, Sebastian Riedel, Karel Lenc, Chih-Kuan Yeh, Aakanksha Chowdhery, Yang Xu, Mehran Kazemi, Ehsan Amid, Anastasia Petrushkina, Kevin Swersky, Ali Khodaei, Gowoon Chen, Chris Larkin, Mario Pinto, Geng Yan, Adria Puigdomenech Badia, Piyush Patil, Steven Hansen, Dave Orr, Sebastien M. R. Arnold, Jordan Grimstad, Andrew Dai, Sholto Douglas, Rishika Sinha, Vikas Yadav, Xi Chen, Elena Gribovskaya, Jacob Austin, Jeffrey Zhao, Kaushal Patel, Paul Komarek, Sophia Austin, Sebastian Borgeaud, Linda Friso, Abhimanyu Goyal, Ben Caine, Kris Cao, Da-Woon Chung, Matthew Lamm, Gabe Barth-Maron, Thais Kagohara, Kate Olszewska, Mia Chen, Kaushik Shivakumar, Rishabh Agarwal, Harshal Godhia, Ravi Rajwar, Javier Snaider, Xerxes Dotiwalla, Yuan Liu, Aditya Barua, Victor Ungureanu, Yuan Zhang, Bat-Orgil Batsaikhan, Mateo Wirth, James Qin, Ivo Danihelka, Tulsee Doshi, Martin Chadwick, Jilin Chen, Sanil Jain, Quoc Le, Arjun Kar, Madhu Gurumurthy, Cheng Li, Ruoxin Sang, Fangyu Liu, Lampros Lamprou, Rich Munoz, Nathan Lintz, Harsh Mehta, Heidi Howard, Malcolm Reynolds, Lora Aroyo, Quan Wang, Lorenzo Blanco, Albin Cassirer, Jordan Griffith, Dipanjan Das, Stephan Lee, Jakub Sygnowski, Zach Fisher, James Besley, Richard Powell, Zafarali Ahmed, Dominik Paulus, David Reitter, Zalan Borsos, Rishabh Joshi, Aedan Pope, Steven Hand, Vittorio Selo, Vihan Jain, Nikhil Sethi, Megha Goel, Takaki Makino, Rhys May, Zhen Yang, Johan Schalkwyk, Christina Butterfield, Anja Hauth, Alex Goldin, Will Hawkins, Evan Senter, Sergey Brin, Oliver Woodman, Marvin Ritter, Eric Noland, Minh Giang, Vijay Bolina, Lisa Lee, Tim Blyth, Ian Mackinnon, Machel Reid, Obaid Sarvana, David Silver, Alexander Chen, Lily Wang, Loren Maggiore, Oscar Chang, Nithya Attaluri, Gregory Thornton, Chung-Cheng Chiu, Oskar Bunyan, Nir Levine, Timothy Chung, Evgenii Eltyshev, Xiance Si, Timothy Lillicrap, Demetra Brady, Vaibhav Aggarwal, Boxi Wu, Yuanzhong Xu, Ross McIlroy, Kartikeya Badola, Paramjit Sandhu, Erica Moreira, Wojciech Stokowiec, Ross Hemsley, Dong Li, Alex Tudor, Pranav Shyam, Elahe Rahimtoroghi, Salem Haykal, Pablo Sprechmann, Xiang Zhou, Diana Mincu, Yujia Li, Ravi Addanki, Kalpesh Krishna, Xiao Wu, Alexandre Frechette, Matan Eyal, Allan Dafoe, Dave Lacey, Jay Whang, Thi Avrahami, Ye Zhang, Emanuel Taropa, Hanzhao Lin, Daniel Toyama, Eliza Rutherford, Motoki Sano, HyunJeong Choe, Alex Tomala, Chalence Safranek-Shrader, Nora Kassner, Mantas Pajarskas, Matt Harvey, Sean Sechrist, Meire Fortunato, Christina Lyu, Gamaleldin Elsayed, Chenkai Kuang, James Lottes, Eric Chu, Chao Jia, Chih-Wei Chen, Peter Humphreys, Kate Baumli, Connie Tao, Rajkumar Samuel, Cicero Nogueira dos Santos, Anders Andreassen, Nemanja Rakićević, Dominik Grewe, Aviral Kumar, Stephanie Winkler, Jonathan Caton, Andrew Brock, Sid Dalmia, Hannah Sheahan, Iain Barr, Yingjie Miao, Paul Natsev, Jacob Devlin, Feryal Behbahani, Flavien Prost, Yanhua Sun, Artiom Myaskovsky, Thanumalayan Sankaranarayana Pillai, Dan Hurt, Angeliki Lazaridou, Xi Xiong, Ce Zheng, Fabio Pardo, Xiaowei Li, Dan Horgan, Joe Stanton, Moran Ambar, Fei Xia, Alejandro Lince, Mingqiu Wang, Basil Mustafa, Albert Webson, Hyo Lee, Rohan Anil, Martin Wicke, Timothy Dozat, Abhishek Sinha, Enrique Piqueras, Elahe Dabir, Shyam Upadhyay, Anudhyan Boral, Lisa Anne Hendricks, Corey Fry, Josip Djolonga, Yi Su, Jake Walker, Jane Labanowski, Ronny Huang, Vedant Misra, Jeremy Chen, RJ Skerry-Ryan, Avi Singh, Shruti Rijhwani, Dian Yu, Alex Castro-Ros, Beer Changpinyo, Romina Datta, Sumit Bagri, Arnar Mar Hrafnkelsson, Marcello Maggioni, Daniel Zheng, Yury Sulsky, Shaobo Hou, Tom Le Paine, Antoine Yang, Jason Riesa, Dominika Rogozinska, Dror Marcus, Dalia El Badawy, Qiao Zhang, Luyu Wang, Helen Miller, Jeremy Greer, Lars Lowe Sjos, Azade Nova, Heiga Zen, Rahma Chaabouni, Mihaela Rosca, Jiepu Jiang, Charlie Chen, Ruibo Liu, Tara Sainath, Maxim Krikun, Alex Polozov, Jean-Baptiste Lespiau, Josh Newlan, Zeyncep Cankara, Soo Kwak, Yunhan Xu, Phil Chen, Andy Coenen, Clemens Meyer, Katerina Tsihlas, Ada Ma, Juraj Gottweis, Jinwei Xing, Chenjie Gu, Jin Miao, Christian Frank, Zeynep Cankara, Sanjay Ganapathy, Ishita Dasgupta, Steph Hughes-Fitt, Heng Chen, David Reid, Keran Rong, Hongmin Fan, Joost van Amersfoort, Vincent Zhuang, Aaron Cohen, Shixiang Shane Gu, Anhad Mohananey, Anastasija Ilic, Taylor Tobin, John Wieting, Anna Bortsova, Phoebe Thacker, Emma Wang, Emily Caveness, Justin Chiu, Eren Sezener, Alex Kaskasoli, Steven Baker, Katie Millican, Mohamed Elhawaty, Kostas Aisopos, Carl Lebsack, Nathan Byrd, Hanjun Dai, Wenhao Jia, Matthew Wiethoff, Elnaz Davoodi, Albert Weston, Lakshman Yagati, Arun Ahuja, Isabel Gao, Golan Pundak, Susan Zhang, Michael Azzam, Khe Chai Sim, Sergi Caelles, James Keeling, Abhanshu Sharma, Andy Swing, YaGuang Li, Chenxi Liu, Carrie Grimes Bostock, Yamini Bansal, Zachary Nado, Ankesh Anand, Josh Lipschultz, Abhijit Karmarkar, Lev Proleev, Abe Ittycheriah, Soheil Hassas Yeganeh, George Polovets, Aleksandra Faust, Jiao Sun, Alban Rrustemi, Pen Li, Rakesh Shivanna, Jeremiah Liu, Chris Welty, Federico Lebron, Anirudh Baddepudi, Sebastian Krause, Emilio Parisotto, Radu Soricut, Zheng Xu, Dawn Bloxwich, Melvin Johnson, Behnam Neyshabur, Justin Mao-Jones, Renshen Wang, Vinay Ramasesh, Zaheer Abbas, Arthur Guez, Constant Segal, Duc Dung Nguyen, James Svensson, Le Hou, Sarah York, Kieran Milan, Sophie Bridgers, Wiktor Gworek, Marco Tagliasacchi, James Lee-Thorp, Michael Chang, Alexey Guseynov, Ale Jakse Hartman, Michael Kwong, Ruizhe Zhao, Sheleem Kashem, Elizabeth Cole, Antoine Miech, Richard Tanburn, Mary Phuong, Filip Pavetic, Sebastien Cevey, Ramona Comanescu, Richard Ives, Sherry Yang, Cosmo Du, Bo Li, Zizhao Zhang, Mariko Iinuma, Clara Huiyi Hu, Aurko Roy, Shaan Bijwadia, Zhenkai Zhu, Danilo Martins, Rachel Saputro, Anita Gergely, Steven Zheng, Dawei Jia, Ioannis Antonoglou, Adam Sadovsky, Shane Gu, Yingying Bi, Alek Andreev, Sina Samangooei, Mina Khan, Tomas Kocisky, Angelos Filos, Chintu Kumar, Colton Bishop, Adams Yu, Sarah Hodkinson, Sid Mittal, Premal Shah, Alexandre Moufarek, Yong Cheng, Adam Bloniarz, Jaehoon Lee, Pedram Pejman, Paul Michel, Stephen Spencer, Vladimir Feinberg, Xuehan Xiong, Nikolay Savinov, Charlotte Smith, Siamak Shakeri, Dustin Tran, Mary Chesus, Bernd Bohnet, George Tucker, Tamara von Glehn, Carrie Muir, Yiran Mao, Hideto Kazawa, Ambrose Slone, Kedar Soparkar, Disha Shrivastava, James Cobon-Kerr, Michael Sharman, Jay Pavagadhi, Carlos Araya, Karolis Misiunas, Nimesh Ghelani, Michael Laskin, David Barker, Qiujia Li, Anton Briukhov, Neil Houlsby, Mia Glaese, Balaji Lakshminarayanan, Nathan Schucher, Yunhao Tang, Eli Collins, Hyeontaek Lim, Fangxiaoyu Feng, Adria Recasens, Guangda Lai, Alberto Magni, Nicola De Cao, Aditya Siddhant, Zoe Ashwood, Jordi Orbay, Mostafa Dehghani, Jenny Brennan, Yifan He, Kelvin Xu, Yang Gao, Carl Saroufim, James Molloy, Xinyi Wu, Seb Arnold, Solomon Chang, Julian Schrittwieser, Elena Buchatskaya, Soroush Radpour, Martin Polacek, Skye Giordano, Ankur Bapna, Simon Tokumine, Vincent Hellendoorn, Thibault Sottiaux, Sarah Cogan, Aliaksei Severyn, Mohammad Saleh, Shantanu Thakoor, Laurent Shefey, Siyuan Qiao, Meenu Gaba, Shuo-yiin Chang, Craig Swanson, Biao Zhang, Benjamin Lee, Paul Kishan Rubenstein, Gan Song, Tom Kwiatkowski, Anna Koop, Ajay Kannan, David Kao, Parker Schuh, Axel Stjerngren, Golnaz Ghiasi, Gena Gibson, Luke Vilnis, Ye Yuan, Felipe Tiengo Ferreira, Aishwarya Kamath, Ted Klimenko, Ken Franko, Kefan Xiao, Indro Bhattacharya, Miteyan Patel, Rui Wang, Alex Morris, Robin Strudel, Vivek Sharma, Peter Choy, Sayed Hadi Hashemi, Jessica Landon, Mara Finkelstein, Priya Jhakra, Justin Frye, Megan Barnes, Matthew Mauger, Dennis Daun, Khuslen Baatarsukh, Matthew Tung, Wael Farhan, Henryk Michalewski, Fabio Viola, Felix de Chaumont Quitry, Charline Le Lan, Tom Hudson, Qingze Wang, Felix Fischer, Ivy Zheng, Elspeth White, Anca Dragan, Jean-baptiste Alayrac, Eric Ni, Alexander Pritzel, Adam Iwanicki, Michael Isard, Anna Bulanova, Lukas Zilka, Ethan Dyer, Devendra Sachan, Srivatsan Srinivasan, Hannah Muckenhirn, Honglong Cai, Amol Mandhane, Mukarram Tariq, Jack W. Rae, Gary Wang, Kareem Ayoub, Nicholas FitzGerald, Yao Zhao, Woohyun Han, Chris Alberti, Dan Garrette, Kashyap Krishnakumar, Mai Gimenez, Anselm Levskaya, Daniel Sohn, Josip Matak, Inaki Iturrate, Michael B. Chang, Jackie Xiang, Yuan Cao, Nishant Ranka, Geoff Brown, Adrian Hutter, Vahab Mirrokni, Nanxin Chen, Kaisheng Yao, Zoltan Egyed, Francois Galilee, Tyler Liechty, Praveen Kallakuri, Evan Palmer, Sanjay Ghemawat, Jasmine Liu, David Tao, Chloe Thornton, Tim Green, Mimi Jasarevic, Sharon Lin, Victor Cotruta, Yi-Xuan Tan, Noah Fiedel, Hongkun Yu, Ed Chi, Alexander Neitz, Jens Heitkaemper, Anu Sinha, Denny Zhou, Yi Sun, Charbel Kaed, Brice Hulse, Swaroop Mishra, Maria Georgaki, Sneha Kudugunta, Clement Farabet, Izhak Shafran, Daniel Vlasic, Anton Tsitsulin, Rajagopal Ananthanarayanan, Alen Carin, Guolong Su, Pei Sun, Shashank V, Gabriel Carvajal, Josef Broder, Iulia Comsa, Alena Repina, William Wong, Warren Weilun Chen, Peter Hawkins, Egor Filonov, Lucia Loher, Christoph Hirnschall, Weiyi Wang, Jingchen Ye, Andrea Burns, Hardie Cate, Diana Gage Wright, Federico Piccinini, Lei Zhang, Chu-Cheng Lin, Ionel Gog, Yana Kulizhskaya, Ashwin Sreevatsa, Shuang Song, Luis C. Cobo, Anand Iyer, Chetan Tekur, Guillermo Garrido, Zhuyun Xiao, Rupert Kemp, Huaixiu Steven Zheng, Hui Li, Ananth Agarwal, Christel Ngani, Kati Goshvadi, Rebeca Santamaria-Fernandez, Wojciech Fica, Xinyun Chen, Chris Gorgolewski, Sean Sun, Roopal Garg, Xinyu Ye, S. M. Ali Eslami, Nan Hua, Jon Simon, Pratik Joshi, Yelin Kim, Ian Tenney, Sahitya Potluri, Lam Nguyen Thiet, Quan Yuan, Florian Luisier, Alexandra Chronopoulou, Salvatore Scellato, Praveen Srinivasan, Minmin Chen, Vinod Koverkathu, Valentin Dalibard, Yaming Xu, Brennan Saeta, Keith Anderson, Thibault Sellam, Nick Fernando, Fantine Huot, Junehyuk Jung, Mani Varadarajan, Michael Quinn, Amit Raul, Maigo Le, Ruslan Habalov, Jon Clark, Komal Jalan, Kalesha Bullard, Achintya Singhal, Thang Luong, Boyu Wang, Sujeevan Rajayogam, Julian Eisenschlos, Johnson Jia, Daniel Finchelstein, Alex Yakubovich, Daniel Balle, Michael Fink, Sameer Agarwal, Jing Li, Dj Dvijotham, Shalini Pal, Kai Kang, Jaclyn Konzelmann, Jennifer Beattie, Olivier Dousse, Diane Wu, Remi Crocker, Chen Elkind, Siddhartha Reddy Jonnalagadda, Jong Lee, Dan Holtmann-Rice, Krystal Kallarackal, Rosanne Liu, Denis Vnukov, Neera Vats, Luca Invernizzi, Mohsen Jafari, Huanjie Zhou, Lilly Taylor, Jennifer Prendki, Marcus Wu, Tom Eccles, Tianqi Liu, Kavya Kopparapu, Francoise Beaufays, Christof Angermueller, Andreea Marzoca, Shourya Sarcar, Hilal Dib, Jeff Stanway, Frank Perbet, Nejc Trdin, Rachel Sterneck, Andrey Khorlin, Dinghua Li, Xihui Wu, Sonam Goenka, David Madras, Sasha Goldshtein, Willi Gierke, Tong Zhou, Yaxin Liu, Yannie Liang, Anais White, Yunjie Li, Shreya Singh, Sanaz Bahargam, Mark Epstein, Sujoy Basu, Li Lao, Adnan Ozturel, Carl Crous, Alex Zhai, Han Lu, Zora Tung, Neeraj Gaur, Alanna Walton, Lucas Dixon, Ming Zhang, Amir Globerson, Grant Uy, Andrew Bolt, Olivia Wiles, Milad Nasr, Ilia Shumailov, Marco Selvi, Francesco Piccinno, Ricardo Aguilar, Sara McCarthy, Misha Khalman, Mrinal Shukla, Vlado Galic, John Carpenter, Kevin Villela, Haibin Zhang, Harry Richardson, James Martens, Matko Bosnjak, Shreyas Rammohan Belle, Jeff Seibert, Mahmoud Alnahlawi, Brian McWilliams, Sankalp Singh, Annie Louis, Wen Ding, Dan Popovici, Lenin Simicich, Laura Knight, Pulkit Mehta, Nishesh Gupta, Chongyang Shi, Saaber Fatehi, Jovana Mitrovic, Alex Grills, Joseph Pagadora, Dessie Petrova, Danielle Eisenbud, Zhishuai Zhang, Damion Yates, Bhavishya Mittal, Nilesh Tripuraneni, Yannis Assael, Thomas Brovelli, Prateek Jain, Mihajlo Velimirovic, Canfer Akbulut, Jiaqi Mu, Wolfgang Macherey, Ravin Kumar, Jun Xu, Haroon Qureshi, Gheorghe Comanici, Jeremy Wiesner, Zhitao Gong, Anton Ruddock, Matthias Bauer, Nick Felt, Anirudh GP, Anurag Arnab, Dustin Zelle, Jonas Rothfuss, Bill Rosgen, Ashish Shenoy, Bryan Seybold, Xinjian Li, Jayaram Mudigonda, Goker Erdogan, Jiawei Xia, Jiri Simsa, Andrea Michi, Yi Yao, Christopher Yew, Steven Kan, Isaac Caswell, Carey Radebaugh, Andre Elisseeff, Pedro Valenzuela, Kay McKinney, Kim Paterson, Albert Cui, Eri Latorre-Chimoto, Solomon Kim, William Zeng, Ken Durden, Priya Ponnapalli, Tiberiu Sosea, Christopher A. Choquette-Choo, James Manyika, Brona Robenek, Harsha Vashisht, Sebastien Pereira, Hoi Lam, Marko Velic, Denese Owusu-Afriyie, Katherine Lee, Tolga Bolukbasi, Alicia Parrish, Shawn Lu, Jane Park, Balaji Venkatraman, Alice Talbert, Lambert Rosique, Yuchung Cheng, Andrei Sozanschi, Adam Paszke, Praveen Kumar, Jessica Austin, Lu Li, Khalid Salama, Wooyeol Kim, Nandita Dukkipati, Anthony Baryshnikov, Christos Kaplanis, XiangHai Sheng, Yuri Chervonyi, Caglar Unlu, Diego de Las Casas, Harry Askham, Kathryn Tunyasuvunakool, Felix Gimeno, Siim Poder, Chester Kwak, Matt Miecnikowski, Vahab Mirrokni, Alek Dimitriev, Aaron Parisi, Dangyi Liu, Tomy Tsai, Toby Shevlane, Christina Kouridi, Drew Garmon, Adrian Goedeckemeyer, Adam R. Brown, Anitha Vijayakumar, Ali Elqursh, Sadegh Jazayeri, Jin Huang, Sara Mc Carthy, Jay Hoover, Lucy Kim, Sandeep Kumar, Wei Chen, Courtney Biles, Garrett Bingham, Evan Rosen, Lisa Wang, Qijun Tan, David Engel, Francesco Pongetti, Dario de Cesare, Dongseong Hwang, Lily Yu, Jennifer Pullman, Srini Narayanan, Kyle Levin, Siddharth Gopal, Megan Li, Asaf Aharoni, Trieu Trinh, Jessica Lo, Norman Casagrande, Roopali Vij, Loic Matthey, Bramandia Ramadhana, Austin Matthews, CJ Carey, Matthew Johnson, Kremena Goranova, Rohin Shah, Shereen Ashraf, Kingshuk Dasgupta, Rasmus Larsen, Yicheng Wang, Manish Reddy Vuyyuru, Chong Jiang, Joana Ijazi, Kazuki Osawa, Celine Smith, Ramya Sree Boppana, Taylan Bilal, Yuma Koizumi, Ying Xu, Yasemin Altun, Nir Shabat, Ben Bariach, Alex Korchemniy, Kiam Choo, Olaf Ronneberger, Chimezie Iwuanyanwu, Shubin Zhao, David Soergel, Cho-Jui Hsieh, Irene Cai, Shariq Iqbal, Martin Sundermeyer, Zhe Chen, Elie Bursztein, Chaitanya Malaviya, Fadi Biadsy, Prakash Shroff, Inderjit Dhillon, Tejasi Latkar, Chris Dyer, Hannah Forbes, Massimo Nicosia, Vitaly Nikolaev, Somer Greene, Marin Georgiev, Pidong Wang, Nina Martin, Hanie Sedghi, John Zhang, Praseem Banzal, Doug Fritz, Vikram Rao, Xuezhi Wang, Jiageng Zhang, Viorica Patraucean, Dayou Du, Igor Mordatch, Ivan Jurin, Lewis Liu, Ayush Dubey, Abhi Mohan, Janek Nowakowski, Vlad-Doru Ion, Nan Wei, Reiko Tojo, Maria Abi Raad, Drew A. Hudson, Vaishakh Keshava, Shubham Agrawal, Kevin Ramirez, Zhichun Wu, Hoang Nguyen, Ji Liu, Madhavi Sewak, Bryce Petrini, DongHyun Choi, Ivan Philips, Ziyue Wang, Ioana Bica, Ankush Garg, Jarek Wilkiewicz, Priyanka Agrawal, Xiaowei Li, Danhao Guo, Emily Xue, Naseer Shaik, Andrew Leach, Sadh MNM Khan, Julia Wiesinger, Sammy Jerome, Abhishek Chakladar, Alek Wenjiao Wang, Tina Ornduff, Folake Abu, Alireza Ghaffarkhah, Marcus Wainwright, Mario Cortes, Frederick Liu, Joshua Maynez, Slav Petrov, Yonghui Wu, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, Oriol Vinyals
In this report, we introduce the Gemini 1.5 family of models, representing
the next generation of highly compute-efficient multimodal models capable of
recalling and reasoning over fine-grained information from millions of tokens
of context, including multiple long documents and hours of video and audio. The
family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds
the February version on the great majority of capabilities and benchmarks; (2)
Gemini 1.5 Flash, a more lightweight variant designed for efficiency with
minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on
long-context retrieval tasks across modalities, improve the state-of-the-art in
long-document QA, long-video QA and long-context ASR, and match or surpass
Gemini 1.0 Ultra's state-of-the-art performance across a broad set of
benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find
continued improvement in next-token prediction and near-perfect retrieval
(>99%) up to at least 10M tokens, a generational leap over existing models such
as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world
use cases, such as Gemini 1.5 collaborating with professionals on completing
their tasks achieving 26 to 75% time savings across 10 different job
categories, as well as surprising new capabilities of large language models at
the frontier; when given a grammar manual for Kalamang, a language with fewer
than 200 speakers worldwide, the model learns to translate English to Kalamang
at a similar level to a person who learned from the same content.
♻ ☆ A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models
Large Vision-Language Models (LVLMs), despite their recent success, are
hardly comprehensively tested for their cognitive abilities. Inspired by the
prevalent use of the "Cookie Theft" task in human cognition test, we propose a
novel evaluation benchmark to evaluate high-level cognitive ability of LVLMs
using images with rich semantics. It defines eight reasoning capabilities and
consists of an image description task and a visual question answering task. Our
evaluation on well-known LVLMs shows that there is still a large gap in
cognitive ability between LVLMs and humans.
♻ ☆ Underneath the Numbers: Quantitative and Qualitative Gender Fairness in LLMs for Depression Prediction
Recent studies show bias in many machine learning models for depression
detection, but bias in LLMs for this task remains unexplored. This work
presents the first attempt to investigate the degree of gender bias present in
existing LLMs (ChatGPT, LLaMA 2, and Bard) using both quantitative and
qualitative approaches. From our quantitative evaluation, we found that ChatGPT
performs the best across various performance metrics and LLaMA 2 outperforms
other LLMs in terms of group fairness metrics. As qualitative fairness
evaluation remains an open research question we propose several strategies
(e.g., word count, thematic analysis) to investigate whether and how a
qualitative evaluation can provide valuable insights for bias analysis beyond
what is possible with quantitative evaluation. We found that ChatGPT
consistently provides a more comprehensive, well-reasoned explanation for its
prediction compared to LLaMA 2. We have also identified several themes adopted
by LLMs to qualitatively evaluate gender fairness. We hope our results can be
used as a stepping stone towards future attempts at improving qualitative
evaluation of fairness for LLMs especially for high-stakes tasks such as
depression detection.
♻ ☆ Empowering Character-level Text Infilling by Eliminating Sub-Tokens ACL 2024
In infilling tasks, sub-tokens, representing instances where a complete token
is segmented into two parts, often emerge at the boundaries of prefixes,
middles, and suffixes. Traditional methods focused on training models at the
token level, leading to sub-optimal performance in character-level infilling
tasks during the inference stage. Alternately, some approaches considered
character-level infilling, but they relied on predicting sub-tokens in
inference, yet this strategy diminished ability in character-level infilling
tasks due to the large perplexity of the model on sub-tokens. In this paper, we
introduce FIM-SE, which stands for Fill-In-the-Middle with both Starting and
Ending character constraints. The proposed method addresses character-level
infilling tasks by utilizing a line-level format to avoid predicting any
sub-token in inference. In addition, we incorporate two special tokens to
signify the rest of the incomplete lines, thereby enhancing generation
guidance. Extensive experiments demonstrate that our proposed approach
surpasses previous methods, offering a significant advantage. Code is available
at https://github.com/SenseLLM/FIM-SE.
comment: Accepted to ACL 2024 (main conference)
♻ ☆ Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
We conducted a detailed analysis on the quality of web-mined corpora for two
low-resource languages (making three language pairs, English-Sinhala,
English-Tamil and Sinhala-Tamil). We ranked each corpus according to a
similarity measure and carried out an intrinsic and extrinsic evaluation on
different portions of this ranked corpus. We show that there are significant
quality differences between different portions of web-mined corpora and that
the quality varies across languages and datasets. We also show that, for some
web-mined datasets, Neural Machine Translation (NMT) models trained with their
highest-ranked 25k portion can be on par with human-curated datasets.
♻ ☆ M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark
Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, Kaicheng Yu
As recent multi-modality large language models (MLLMs) have shown formidable
proficiency on various complex tasks, there has been increasing attention on
debating whether these models could eventually mirror human intelligence.
However, existing benchmarks mainly focus on evaluating solely on task
performance, such as the accuracy of identifying the attribute of an object.
Combining well-developed cognitive science to understand the intelligence of
MLLMs beyond superficial achievements remains largely unexplored. To this end,
we introduce the first cognitive-driven multi-lingual and multi-modal benchmark
to evaluate the general intelligence ability of MLLMs, dubbed M3GIA.
Specifically, we identify five key cognitive factors based on the
well-recognized Cattell-Horn-Carrol (CHC) model of intelligence and propose a
novel evaluation metric. In addition, since most MLLMs are trained to perform
in different languages, a natural question arises: is language a key factor
influencing the cognitive ability of MLLMs? As such, we go beyond English to
encompass other languages based on their popularity, including Chinese, French,
Spanish, Portuguese and Korean, to construct our M3GIA. We make sure all the
data relevant to the cultural backgrounds are collected from their native
context to avoid English-centric bias. We collected a significant corpus of
data from human participants, revealing that the most advanced MLLM reaches the
lower boundary of human intelligence in English. Yet, there remains a
pronounced disparity in the other five languages assessed. We also reveals an
interesting winner takes all phenomenon that are aligned with the discovery in
cognitive studies. Our benchmark will be open-sourced, with the aspiration of
facilitating the enhancement of cognitive capabilities in MLLMs.
♻ ☆ Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework
The acceleration of Large Language Models (LLMs) research has opened up new
possibilities for evaluating generated texts. They serve as scalable and
economical evaluators, but the question of how reliable these evaluators are
has emerged as a crucial research question. Prior research efforts in the
meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use
to obtain a final evaluation decision. They then compute the agreement between
LLMs' outputs and human labels. This lacks interpretability in understanding
the evaluation capability of LLMs. In light of this challenge, we propose
Decompose and Aggregate, which breaks down the evaluation process into
different stages based on pedagogical practices. Our experiments illustrate
that it not only provides a more interpretable window for how well LLMs
evaluate, but also leads to improvements up to 39.6% for different LLMs on a
variety of meta-evaluation benchmarks.
♻ ☆ TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models
Mainstream approaches to aligning large language models (LLMs) heavily rely
on human preference data, particularly when models require periodic updates.
The standard process for iterative alignment of LLMs involves collecting new
human feedback for each update. However, the data collection process is costly
and challenging to scale. To address this issue, we introduce the "TS-Align"
framework, which fine-tunes a policy model using pairwise feedback data
automatically mined from its outputs. This automatic mining process is
efficiently accomplished through the collaboration between a large-scale
teacher model and a small-scale student model. The policy fine-tuning process
can be iteratively repeated using on-policy generations within our proposed
teacher-student collaborative framework. Through extensive experiments, we
demonstrate that our final aligned policy outperforms the base policy model
with an average win rate of 69.7% across seven conversational or
instruction-following datasets. Furthermore, we show that the ranking
capability of the teacher is effectively distilled into the student through our
pipeline, resulting in a small-scale yet effective reward model for policy
model alignment.
♻ ☆ Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding
Yaoxian Song, Penglei Sun, Piaopiao Jin, Yi Ren, Yu Zheng, Zhixu Li, Xiaowen Chu, Yue Zhang, Tiefeng Li, Jason Gu
Robotic grasping is a fundamental ability for a robot to interact with the
environment. Current methods focus on how to obtain a stable and reliable
grasping pose in object level, while little work has been studied on part
(shape)-wise grasping which is related to fine-grained grasping and robotic
affordance. Parts can be seen as atomic elements to compose an object, which
contains rich semantic knowledge and a strong correlation with affordance.
However, lacking a large part-wise 3D robotic dataset limits the development of
part representation learning and downstream applications. In this paper, we
propose a new large Language-guided SHape grAsPing datasEt (named LangSHAPE) to
promote 3D part-level affordance and grasping ability learning. From the
perspective of robotic cognition, we design a two-stage fine-grained robotic
grasping framework (named LangPartGPD), including a novel 3D part language
grounding model and a part-aware grasp pose detection model, in which explicit
language input from human or large language models (LLMs) could guide a robot
to generate part-level 6-DoF grasping pose with textual explanation. Our method
combines the advantages of human-robot collaboration and LLMs' planning ability
using explicit language as a symbolic intermediate. To evaluate the
effectiveness of our proposed method, we perform 3D part grounding and
fine-grained grasp detection experiments on both simulation and physical robot
settings, following language instructions across different degrees of textual
complexity. Results show our method achieves competitive performance in 3D
geometry fine-grained grounding, object affordance inference, and 3D part-aware
grasping tasks. Our dataset and code are available on our project website
https://sites.google.com/view/lang-shape
comment: 14 pages, 7 figures, 6 tables
♻ ☆ Scalable MatMul-free Language Modeling
Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian
Matrix multiplication (MatMul) typically dominates the overall computational
cost of large language models (LLMs). This cost only grows as LLMs scale to
larger embedding dimensions and context lengths. In this work, we show that
MatMul operations can be completely eliminated from LLMs while maintaining
strong performance at billion-parameter scales. Our experiments show that our
proposed MatMul-free models achieve performance on-par with state-of-the-art
Transformers that require far more memory during inference at a scale up to at
least 2.7B parameters. We investigate the scaling laws and find that the
performance gap between our MatMul-free models and full precision Transformers
narrows as the model size increases. We also provide a GPU-efficient
implementation of this model which reduces memory usage by up to 61% over an
unoptimized baseline during training. By utilizing an optimized kernel during
inference, our model's memory consumption can be reduced by more than 10x
compared to unoptimized models. To properly quantify the efficiency of our
architecture, we build a custom hardware solution on an FPGA which exploits
lightweight operations beyond what GPUs are capable of. We processed
billion-parameter scale models at 13W beyond human readable throughput, moving
LLMs closer to brain-like efficiency. This work not only shows how far LLMs can
be stripped back while still performing effectively, but also points at the
types of operations future accelerators should be optimized for in processing
the next generation of lightweight LLMs. Our code implementation is available
at https://github.com/ridgerchu/matmulfreellm.
♻ ☆ Understanding Inter-Session Intentions via Complex Logical Reasoning
Understanding user intentions is essential for improving product
recommendations, navigation suggestions, and query reformulations. However,
user intentions can be intricate, involving multiple sessions and attribute
requirements connected by logical operators such as And, Or, and Not. For
instance, a user may search for Nike or Adidas running shoes across various
sessions, with a preference for purple. In another example, a user may have
purchased a mattress in a previous session and is now looking for a matching
bed frame without intending to buy another mattress. Existing research on
session understanding has not adequately addressed making product or attribute
recommendations for such complex intentions. In this paper, we present the task
of logical session complex query answering (LS-CQA), where sessions are treated
as hyperedges of items, and we frame the problem of complex intention
understanding as an LS-CQA task on an aggregated hypergraph of sessions, items,
and attributes. This is a unique complex query answering task with sessions as
ordered hyperedges. We also introduce a new model, the Logical Session Graph
Transformer (LSGT), which captures interactions among items across different
sessions and their logical connections using a transformer structure. We
analyze the expressiveness of LSGT and prove the permutation invariance of the
inputs for the logical operators. By evaluating LSGT on three datasets, we
demonstrate that it achieves state-of-the-art results.
♻ ☆ Cross-Subject Data Splitting for Brain-to-Text Decoding
Recent major milestones have successfully decoded non-invasive brain signals
(e.g. functional Magnetic Resonance Imaging (fMRI) and electroencephalogram
(EEG)) into natural language. Despite the progress in model design, how to
split the datasets for training, validating, and testing still remains a matter
of debate. Most of the prior researches applied subject-specific data
splitting, where the decoding model is trained and evaluated per subject. Such
splitting method poses challenges to the utilization efficiency of dataset as
well as the generalization of models. In this study, we propose a cross-subject
data splitting criterion for brain-to-text decoding on various types of
cognitive dataset (fMRI, EEG), aiming to maximize dataset utilization and
improve model generalization. We undertake a comprehensive analysis on existing
cross-subject data splitting strategies and prove that all these methods suffer
from data leakage, namely the leakage of test data to training set, which
significantly leads to overfitting and overestimation of decoding models. The
proposed cross-subject splitting method successfully addresses the data leakage
problem and we re-evaluate some SOTA brain-to-text decoding models as baselines
for further research.
♻ ☆ On Context Utilization in Summarization with Large Language Models ACL 2024
Large language models (LLMs) excel in abstractive summarization tasks,
delivering fluent and pertinent summaries. Recent advancements have extended
their capabilities to handle long-input contexts, exceeding 100k tokens.
However, in question answering, language models exhibit uneven utilization of
their input context. They tend to favor the initial and final segments,
resulting in a U-shaped performance pattern concerning where the answer is
located within the input. This bias raises concerns, particularly in
summarization where crucial content may be dispersed throughout the source
document(s). Besides, in summarization, mapping facts from the source to the
summary is not trivial as salient content is usually re-phrased. In this paper,
we conduct the first comprehensive study on context utilization and position
bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5
evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on
the which we benchmark two alternative inference methods to alleviate position
bias: hierarchical summarization and incremental summarization. Our code and
data can be found here: https://github.com/ntunlp/MiddleSum.
comment: ACL 2024. 9 pages, 7 figures, 3 tables
♻ ☆ StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, Yang Liu
Large Language Models (LLMs) have witnessed remarkable advancements in recent
years, prompting the exploration of tool learning, which integrates LLMs with
external tools to address diverse real-world challenges. Assessing the
capability of LLMs to utilise tools necessitates large-scale and stable
benchmarks. However, previous works relied on either hand-crafted online tools
with limited scale, or large-scale real online APIs suffering from instability
of API status. To address this problem, we introduce StableToolBench, a
benchmark evolving from ToolBench, proposing a virtual API server and stable
evaluation system. The virtual API server contains a caching system and API
simulators which are complementary to alleviate the change in API status.
Meanwhile, the stable evaluation system designs solvable pass and win rates
using GPT-4 as the automatic evaluator to eliminate the randomness during
evaluation. Experimental results demonstrate the stability of StableToolBench,
and further discuss the effectiveness of API simulators, the caching system,
and the evaluator system.
♻ ☆ FusionBench: A Comprehensive Benchmark of Deep Model Fusion
Deep model fusion is an emerging technique that unifies the predictions or
parameters of several deep neural networks into a single model in a
cost-effective and data-efficient manner. This enables the unified model to
take advantage of the original models' strengths, potentially exceeding their
performance. Although a variety of deep model fusion techniques have been
introduced, their evaluations tend to be inconsistent and often inadequate to
validate their effectiveness and robustness against distribution shifts. To
address this issue, we introduce FusionBench, which is the first comprehensive
benchmark dedicated to deep model fusion. FusionBench covers a wide range of
tasks, including open-vocabulary image classification, text classification, and
text-to-text generation. Each category includes up to eight tasks with
corresponding task-specific models, featuring both full fine-tuning and LoRA
fine-tuning, as well as models of different sizes, to ensure fair and balanced
comparisons of various multi-task model fusion techniques across different
tasks, model scales, and fine-tuning strategies. We implement and evaluate a
broad spectrum of deep model fusion techniques. These techniques range from
model ensemble methods, which combine the predictions to improve the overall
performance, to model merging, which integrates different models into a single
one, and model mixing methods, which upscale or recombine the components of the
original models. FusionBench now contains 26 distinct tasks, 74 fine-tuned
models, and 16 fusion techniques, and we are committed to consistently
expanding the benchmark with more tasks, models, and fusion techniques. In
addition, we offer a well-documented set of resources and guidelines to aid
researchers in understanding and replicating the benchmark results. Homepage
https://github.com/tanganke/fusion_bench
comment: Project homepage: https://github.com/tanganke/fusion_bench
♻ ☆ Unsupervised extraction of local and global keywords from a single text
We propose an unsupervised, corpus-independent method to extract keywords
from a single text. It is based on the spatial distribution of words and the
response of this distribution to a random permutation of words. As compared to
existing methods (such as e.g. YAKE) our method has three advantages. First, it
is significantly more effective at extracting keywords from long texts. Second,
it allows inference of two types of keywords: local and global. Third, it
uncovers basic themes in texts. Additionally, our method is
language-independent and applies to short texts. The results are obtained via
human annotators with previous knowledge of texts from our database of
classical literary works (the agreement between annotators is from moderate to
substantial). Our results are supported via human-independent arguments based
on the average length of extracted content words and on the average number of
nouns in extracted words. We discuss relations of keywords with higher-order
textual features and reveal a connection between keywords and chapter
divisions.
comment: 10 pages, 1 figure
♻ ☆ Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling ACL 2024
Hang Jiang, Xiajie Zhang, Robert Mahari, Daniel Kessler, Eric Ma, Tal August, Irene Li, Alex 'Sandy' Pentland, Yoon Kim, Jad Kabbara, Deb Roy
Making legal knowledge accessible to non-experts is crucial for enhancing
general legal literacy and encouraging civic participation in democracy.
However, legal documents are often challenging to understand for people without
legal backgrounds. In this paper, we present a novel application of large
language models (LLMs) in legal education to help non-experts learn intricate
legal concepts through storytelling, an effective pedagogical tool in conveying
complex and abstract concepts. We also introduce a new dataset LegalStories,
which consists of 294 complex legal doctrines, each accompanied by a story and
a set of multiple-choice questions generated by LLMs. To construct the dataset,
we experiment with various LLMs to generate legal stories explaining these
concepts. Furthermore, we use an expert-in-the-loop approach to iteratively
design multiple-choice questions. Then, we evaluate the effectiveness of
storytelling with LLMs through randomized controlled trials (RCTs) with legal
novices on 10 samples from the dataset. We find that LLM-generated stories
enhance comprehension of legal concepts and interest in law among non-native
speakers compared to only definitions. Moreover, stories consistently help
participants relate legal concepts to their lives. Finally, we find that
learning with stories shows a higher retention rate for non-native speakers in
the follow-up assessment. Our work has strong implications for using LLMs in
promoting teaching and learning in the legal field and beyond.
comment: Accepted to ACL 2024
♻ ☆ Self-Play Preference Optimization for Language Model Alignment
Traditional reinforcement learning from human feedback (RLHF) approaches
relying on parametric models like the Bradley-Terry model fall short in
capturing the intransitivity and irrationality in human preferences. Recent
advancements suggest that directly working with preference probabilities can
yield a more accurate reflection of human preferences, enabling more flexible
and accurate language model alignment. In this paper, we propose a
self-play-based method for language model alignment, which treats the problem
as a constant-sum two-player game aimed at identifying the Nash equilibrium
policy. Our approach, dubbed Self-Play Preference Optimization (SPPO),
approximates the Nash equilibrium through iterative policy updates and enjoys a
theoretical convergence guarantee. Our method can effectively increase the
log-likelihood of the chosen response and decrease that of the rejected
response, which cannot be trivially achieved by symmetric pairwise loss such as
Direct Preference Optimization (DPO) and Identity Preference Optimization
(IPO). In our experiments, using only 60k prompts (without responses) from the
UltraFeedback dataset and without any prompt augmentation, by leveraging a
pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain
a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the
state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on
AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and
the Open LLM Leaderboard. Starting from a stronger base model
Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate of
38.77%. Notably, the strong performance of SPPO is achieved without additional
external supervision (e.g., responses, preferences, etc.) from GPT-4 or other
stronger language models. Codes are available at
https://github.com/uclaml/SPPO.
comment: 27 pages, 4 figures, 5 tables
♻ ☆ RDRec: Rationale Distillation for LLM-based Recommendation ACL 2024
Large language model (LLM)-based recommender models that bridge users and
items through textual prompts for effective semantic reasoning have gained
considerable attention. However, few methods consider the underlying rationales
behind interactions, such as user preferences and item attributes, limiting the
reasoning capability of LLMs for recommendations. This paper proposes a
rationale distillation recommender (RDRec), a compact model designed to learn
rationales generated by a larger language model (LM). By leveraging rationales
from reviews related to users and items, RDRec remarkably specifies their
profiles for recommendations. Experiments show that RDRec achieves
state-of-the-art (SOTA) performance in both top-N and sequential
recommendations. Our source code is released at
https://github.com/WangXFng/RDRec.
comment: 10 pages. Accepted to ACL 2024 Main as a short paper
♻ ☆ L^2GC:Lorentzian Linear Graph Convolutional Networks for Node Classification LREC
Linear Graph Convolutional Networks (GCNs) are used to classify the node in
the graph data. However, we note that most existing linear GCN models perform
neural network operations in Euclidean space, which do not explicitly capture
the tree-like hierarchical structure exhibited in real-world datasets that
modeled as graphs. In this paper, we attempt to introduce hyperbolic space into
linear GCN and propose a novel framework for Lorentzian linear GCN.
Specifically, we map the learned features of graph nodes into hyperbolic space,
and then perform a Lorentzian linear feature transformation to capture the
underlying tree-like structure of data. Experimental results on standard
citation networks datasets with semi-supervised learning show that our approach
yields new state-of-the-art results of accuracy 74.7$\%$ on Citeseer and
81.3$\%$ on PubMed datasets. Furthermore, we observe that our approach can be
trained up to two orders of magnitude faster than other nonlinear GCN models on
PubMed dataset. Our code is publicly available at
https://github.com/llqy123/LLGC-master.
comment: Accepted by LREC-COLING 2024
♻ ☆ Sunnie: An Anthropomorphic LLM-Based Conversational Agent for Mental Well-Being Activity Recommendation
A longstanding challenge in mental well-being support is the reluctance of
people to adopt psychologically beneficial activities, often due to lack of
motivation, low perceived trustworthiness, and limited personalization of
recommendations. Chatbots have shown promise in promoting positive mental
health practices, yet their rigid interaction flows and less human-like
conversational experiences present significant limitations. In this work, we
explore whether the anthropomorphic design (both LLM's persona design and
conversational experience design) can enhance users' perception of the system
and their willingness to adopt mental well-being activity recommendations. To
this end, we introduce Sunnie, an anthropomorphic LLM-based conversational
agent designed to offer personalized well-being support through multi-turn
conversation and recommend practical actions grounded in positive psychology
and social psychology. An empirical user study comparing the user experience
with Sunnie and with a traditional survey-based activity recommendation system
suggests that the anthropomorphic characteristics of Sunnie significantly
enhance users' perception of the system and the overall usability;
nevertheless, users' willingness to adopt activity recommendations did not
change significantly.
comment: In Submission
♻ ☆ SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning LREC
Contrastive learning has recently achieved compelling performance in
unsupervised sentence representation. As an essential element, data
augmentation protocols, however, have not been well explored. The pioneering
work SimCSE resorting to a simple dropout mechanism (viewed as continuous
augmentation) surprisingly dominates discrete augmentations such as cropping,
word deletion, and synonym replacement as reported. To understand the
underlying rationales, we revisit existing approaches and attempt to
hypothesize the desiderata of reasonable data augmentation methods: balance of
semantic consistency and expression diversity. We then develop three simple yet
effective discrete sentence augmentation schemes: punctuation insertion, modal
verbs, and double negation. They act as minimal noises at lexical level to
produce diverse forms of sentences. Furthermore, standard negation is
capitalized on to generate negative samples for alleviating feature suppression
involved in contrastive learning. We experimented extensively with semantic
textual similarity on diverse datasets. The results support the superiority of
the proposed methods consistently. Our key code is available at
https://github.com/Zhudongsheng75/SDA
comment: Accepted by LREC-COLING 2024
♻ ☆ GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives NAACL 2024
Vinodkumar Prabhakaran, Christopher Homan, Lora Aroyo, Aida Mostafazadeh Davani, Alicia Parrish, Alex Taylor, Mark Díaz, Ding Wang, Gregory Serapio-García
Human annotation plays a core role in machine learning -- annotations for
supervised models, safety guardrails for generative models, and human feedback
for reinforcement learning, to cite a few avenues. However, the fact that many
of these human annotations are inherently subjective is often overlooked.
Recent work has demonstrated that ignoring rater subjectivity (typically
resulting in rater disagreement) is problematic within specific tasks and for
specific subgroups. Generalizable methods to harness rater disagreement and
thus understand the socio-cultural leanings of subjective tasks remain elusive.
In this paper, we propose GRASP, a comprehensive disagreement analysis
framework to measure group association in perspectives among different rater
sub-groups, and demonstrate its utility in assessing the extent of systematic
disagreements in two datasets: (1) safety annotations of human-chatbot
conversations, and (2) offensiveness annotations of social media posts, both
annotated by diverse rater pools across different socio-demographic axes. Our
framework (based on disagreement metrics) reveals specific rater groups that
have significantly different perspectives than others on certain tasks, and
helps identify demographic axes that are crucial to consider in specific task
contexts.
comment: Presented as a long paper at NAACL 2024 main conference
♻ ☆ Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
Chong Ma, Hanqi Jiang, Wenting Chen, Yiwei Li, Zihao Wu, Xiaowei Yu, Zhengliang Liu, Lei Guo, Dajiang Zhu, Tuo Zhang, Dinggang Shen, Tianming Liu, Xiang Li
In the medical multi-modal frameworks, the alignment of cross-modality
features presents a significant challenge. However, existing works have learned
features that are implicitly aligned from the data, without considering the
explicit relationships in the medical context. This data-reliance may lead to
low generalization of the learned alignment relationships. In this work, we
propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness
eye-gaze data for better alignment of medical visual and textual features. We
explore the natural auxiliary role of radiologists' eye-gaze data in aligning
medical images and text, and introduce a novel approach by using eye-gaze data,
collected synchronously by radiologists during diagnostic evaluations. We
conduct downstream tasks of image classification and image-text retrieval on
four medical datasets, where EGMA achieved state-of-the-art performance and
stronger generalization across different datasets. Additionally, we explore the
impact of varying amounts of eye-gaze data on model performance, highlighting
the feasibility and utility of integrating this auxiliary data into multi-modal
alignment framework.
comment: 12 pages, 6 figures
♻ ☆ MORE: Multi-mOdal REtrieval Augmented Generative Commonsense Reasoning ACL
Since commonsense information has been recorded significantly less frequently
than its existence, language models pre-trained by text generation have
difficulty to learn sufficient commonsense knowledge. Several studies have
leveraged text retrieval to augment the models' commonsense ability. Unlike
text, images capture commonsense information inherently but little effort has
been paid to effectively utilize them. In this work, we propose a novel
Multi-mOdal REtrieval (MORE) augmentation framework, to leverage both text and
images to enhance the commonsense ability of language models. Extensive
experiments on the Common-Gen task have demonstrated the efficacy of MORE based
on the pre-trained models of both single and multiple modalities.
comment: Published as a conference paper at ACL Findings 2024
♻ ☆ Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores
The kNN-CTC model has proven to be effective for monolingual automatic speech
recognition (ASR). However, its direct application to multilingual scenarios
like code-switching, presents challenges. Although there is potential for
performance improvement, a kNN-CTC model utilizing a single bilingual datastore
can inadvertently introduce undesirable noise from the alternative language. To
address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR)
framework that employs dual monolingual datastores and a gated datastore
selection mechanism to reduce noise interference. Our method selects the
appropriate datastore for decoding each frame, ensuring the injection of
language-specific information into the ASR process. We apply this framework to
cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive
experiments demonstrate the remarkable effectiveness of our gated datastore
mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
♻ ☆ Are EEG-to-Text Models Working?
This work critically analyzes existing models for open-vocabulary EEG-to-Text
translation. We identify a crucial limitation: previous studies often employed
implicit teacher-forcing during evaluation, artificially inflating performance
metrics. Additionally, they lacked a critical benchmark - comparing model
performance on pure noise inputs. We propose a methodology to differentiate
between models that truly learn from EEG signals and those that simply memorize
training data. Our analysis reveals that model performance on noise data can be
comparable to that on EEG data. These findings highlight the need for stricter
evaluation practices in EEG-to-Text research, emphasizing transparent reporting
and rigorous benchmarking with noise inputs. This approach will lead to more
reliable assessments of model capabilities and pave the way for robust
EEG-to-Text communication systems.
♻ ☆ AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
Evaluating the alignment capabilities of large Vision-Language Models (VLMs)
is essential for determining their effectiveness as helpful assistants.
However, existing benchmarks primarily focus on basic abilities using nonverbal
methods, such as yes-no and multiple-choice questions. In this paper, we
address this gap by introducing AlignMMBench, a comprehensive alignment
benchmark specifically designed for emerging Chinese VLMs. This benchmark is
meticulously curated from real-world scenarios and Chinese Internet sources,
encompassing thirteen specific tasks across three categories, and includes both
single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite
strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer
pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a
rule-calibrated evaluator that exceeds GPT-4's evaluation ability. Finally, we
report the performance of representative VLMs on AlignMMBench, offering
insights into the capabilities and limitations of different VLM architectures.
All evaluation codes and data are available on https://alignmmbench.github.io.
♻ ☆ TimeCMA: Towards LLM-Empowered Time Series Forecasting via Cross-Modality Alignment
The widespread adoption of scalable mobile sensing has led to large amounts
of time series data for real-world applications. A fundamental application is
multivariate time series forecasting (MTSF), which aims to predict future time
series values based on historical observations. Existing MTSF methods suffer
from limited parameterization and small-scale training data. Recently, Large
language models (LLMs) have been introduced in time series, which achieve
promising forecasting performance but incur heavy computational costs. To solve
these challenges, we propose TimeCMA, an LLM-empowered framework for time
series forecasting with cross-modality alignment. We design a dual-modality
encoding module with two branches, where the time series encoding branch
extracts relatively low-quality yet pure embeddings of time series through an
inverted Transformer. In addition, the LLM-empowered encoding branch wraps the
same time series as prompts to obtain high-quality yet entangled prompt
embeddings via a Pre-trained LLM. Then, we design a cross-modality alignment
module to retrieve high-quality and pure time series embeddings from the prompt
embeddings. Moreover, we develop a time series forecasting module to decode the
aligned embeddings while capturing dependencies among multiple variables for
forecasting. Notably, we tailor the prompt to encode sufficient temporal
information into a last token and design the last token embedding storage to
reduce computational costs. Extensive experiments on real data offer insight
into the accuracy and efficiency of the proposed framework.
♻ ☆ VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild ACL 2024
We introduce VoiceCraft, a token infilling neural codec language model, that
achieves state-of-the-art performance on both speech editing and zero-shot
text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft
employs a Transformer decoder architecture and introduces a token rearrangement
procedure that combines causal masking and delayed stacking to enable
generation within an existing sequence. On speech editing tasks, VoiceCraft
produces edited speech that is nearly indistinguishable from unedited
recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS,
our model outperforms prior SotA models including VALLE and the popular
commercial model XTTS-v2. Crucially, the models are evaluated on challenging
and realistic datasets, that consist of diverse accents, speaking styles,
recording conditions, and background noise and music, and our model performs
consistently well compared to other models and real recordings. In particular,
for speech editing evaluation, we introduce a high quality, challenging, and
realistic dataset named RealEdit. We encourage readers to listen to the demos
at https://jasonppy.github.io/VoiceCraft_web.
comment: ACL 2024. Data, code, and model weights are available at
https://github.com/jasonppy/VoiceCraft